Remote Workspaces#
Oxen has the concept of a βremote workspaceβ to enable easy data collection and labeling workflows. There are two main types of data you might want to stage.
Unstructured data files (images, videos, audio, text)
Structured annotations (rows for tabular data frames)
Instead of cloning the entire dataset locally (which can take a lot of time, bandwidth, and storage) you can stage data directly on the remote server.
The commands you are used to working with in your local workspace (status
, add
, commit
, etcβ¦) now work with the remote workspace. Each userβs changes are sand-boxed to their own identity, so when you add to a remote workspace, it will not overlap with other users.
Staging Files#
One problem with extending a dataset today is that you have to download the whole data repository locally to add a single data point. This is not ideal for large datasets. To avoid this extra workflow, oxen has the remote
subcommand.
Shallow Clone#
To start, you can clone a repository with the --shallow
flag. This flag downloads the metadata about the remote files, but not the files themselves. To make sure you are on the correct branch you should also pass the -b
flag.
$ oxen clone https://hub.oxen.ai/ox/CatDogBoundingBox --shallow -b branch-name
$ cd CatDogBoundingBox
$ ls # note that no files have been pulled, leaving your repo in a shallow state
Note: When you do a shallow clone, your local commands will not work until you oxen pull
the data. Pulling a branch will get you back to a fully synced state.
Create Remote Branch#
After you have a shallow clone, then you can create a local branch, and push it to the remote. Every remote branch has a remote workspace that is tied to the branch.
$ oxen checkout -b add-images
$ oxen push origin add-images
Check Remote Status#
Now that you have created a remote branch, you can interact with the remote workspace with the oxen remote
subcommand. The oxen remote subcommand defaults to checking the current branch you are on but on the remote server.
$ oxen remote status
Remote Add File#
To add a file to the remote workspace simply use oxen remote add
.
$ oxen remote add image.jpg
For relative paths, oxen will mirror the directory structure you have locally.
$ mkdir my-images/ # create local dir
$ cp /path/to/image.jpg my-images/ # add image to local dir
$ oxen remote add my-images/image.jpg # upload image to the remote workspace in the my-images/ directory
For absolute paths to a file, you will also need to specify the path you would like to put it in with the -p
flag.
$ oxen remote add /path/to/image.jpg -p my-images # upload image to the remote workspace
You can now use the oxen remote status
command to see the files that are staged on the remote branch.
$ oxen remote status
Checking remote branch add-images -> 6f98e855fbc0fd1
Directories to be committed
added: my-images with 1 file
Files to be committed:
new file: my-images/image.jpg
Delete Remotely Added File#
If you accidentally add file from the remote workspace and want to remove it, no worries, you can unstage it with oxen remote rm
.
(TODO: right now the functionality only operates on workspace regardless of the βstaged flag, we might want to allow remote removing of files and directories).
$ oxen remote rm --staged my-images/image.jpg
Commit Staged Files#
When you are ready to commit the staged data you can call the oxen remote commit
command.
$ oxen remote commit -m "adding my file without pulling the whole repo"
You have now committed data to the remote branch without cloning the full repo π.
Note: If the remote branch cannot be merged cleanly, the remote commit will fail, and you will have to resolve the merge conflicts with some more advanced commands which we will cover later.
Remote Log#
To see a list of remote commits on the branch you can use remote log
. Your latest commit will be at the top of this list.
$ oxen remote log
Staging Tabular Data#
Commonly, you will want to tie some sort of annotation to your unstructured data. For example, you might want to label an image with a bounding box, or a video with a bounding box and a class label.
Oxen has native support for extending and managing structured DataFrames in the form of csv, jsonl, or parquet files. To interact with these files remotely you can use the oxen remote df
command.
We will be focusing on adding data to these files, but you can also use the oxen remote df
command to view the contents of a DataFrame with all the same parameters locally TODO add link to df docs.
$ oxen remote df annotations/train.csv # get a summary of the DataFrame
Full shape: (9000, 6)
Slice shape: (10, 6)
βββββββββββββββββββββββββββ¬βββββββββ¬ββββββββ¬βββββββββ¬βββββββββ¬βββββββββ
β file β height β label β min_x β min_y β width β
β --- β --- β --- β --- β --- β --- β
β str β f64 β str β f64 β f64 β f64 β
βββββββββββββββββββββββββββͺβββββββββͺββββββββͺβββββββββͺβββββββββͺβββββββββ‘
β images/000000128154.jpg β 129.58 β cat β 0.0 β 19.27 β 130.79 β
β images/000000544590.jpg β 188.35 β cat β 9.75 β 13.49 β 214.25 β
β images/000000000581.jpg β 116.08 β dog β 49.37 β 67.79 β 74.29 β
β images/000000236841.jpg β 42.29 β cat β 115.21 β 96.65 β 93.87 β
β ... β ... β ... β ... β ... β ... β
β images/000000201969.jpg β 64.94 β dog β 167.24 β 73.99 β 37.0 β
β images/000000201969.jpg β 38.95 β dog β 110.81 β 83.87 β 18.02 β
β images/000000201969.jpg β 18.55 β dog β 157.04 β 133.63 β 38.63 β
β images/000000201969.jpg β 71.11 β dog β 97.72 β 110.2 β 35.9 β
βββββββββββββββββββββββββββ΄βββββββββ΄ββββββββ΄βββββββββ΄βββββββββ΄βββββββββ
Say you want to add a bounding box annotation to this dataframe without cloning it locally. You can use the --add-row
flag on the oxen remote df
command to remotely stage a row on the DataFrame.
TODO: change the remote status to not be modified but be added
$ oxen remote df annotations/train.csv --add-row "my-images/image.jpg,dog,100,100,200,200"
shape: (1, 7)
ββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββ¬ββββββββ¬ββββββββ¬ββββββββ¬ββββββββ¬βββββββββ
β _id β file β label β min_x β min_y β width β height β
β --- β --- β --- β --- β --- β --- β --- β
β str β str β str β f64 β f64 β f64 β f64 β
ββββββββββββββββββββββββββββββββββββͺβββββββββββββββββββββββͺββββββββͺββββββββͺββββββββͺββββββββͺβββββββββ‘
β 744bc2f5736472a0b8fec3339bf14615 β my-images/image3.jpg β dog β 100.0 β 100.0 β 200.0 β 200.0 β
ββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββ΄ββββββββ΄ββββββββ΄ββββββββ΄ββββββββ΄βββββββββ
This returns a unique ID for the row that we can use as a handle to interact with the specific row in the remote workspace. To list the added rows on the dataframe you can use the oxen remote diff
command.
$ oxen remote diff annotations/train.csv
Added Rows
shape: (2, 7)
ββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββ¬ββββββββ¬ββββββββ¬ββββββββ¬ββββββββ¬βββββββββ
β _id β file β label β min_x β min_y β width β height β
β --- β --- β --- β --- β --- β --- β --- β
β str β str β str β f64 β f64 β f64 β f64 β
ββββββββββββββββββββββββββββββββββββͺβββββββββββββββββββββββͺββββββββͺββββββββͺββββββββͺββββββββͺβββββββββ‘
β 822ac1facbd79444f1f33a2a0b2f909d β my-images/image2.jpg β dog β 100.0 β 100.0 β 200.0 β 200.0 β
β ab8e28d66d21934f35efcb9af7ce866f β my-images/image3.jpg β dog β 100.0 β 100.0 β 200.0 β 200.0 β
ββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββ΄ββββββββ΄ββββββββ΄ββββββββ΄ββββββββ΄βββββββββ
If you want to delete a staged row, you can delete it with the --delete-row
flag and the value in the _id
column.
$ oxen remote df annotations/train.csv --delete-row 822ac1facbd79444f1f33a2a0b2f909d
To clear all staged rows, you can use the restore
subcommand to restore the file.
$ oxen remote restore --staged annotations/train.csv