Hi Ruslan,
I would like to ask the best practice on how to handle the following situation with DVC.
Let’s suppose we have a big raw dataset stored in the cloud.
- So, we initialize a project (git + dvc) to work with this dataset. At first we would like to trace how to get the dataset locally. For this we use
dvc run
with commands as wget url
to download the raw dataset. We commit changes etc.
- Next we develop some scripts to preprocess the raw dataset into a dataset we call v0.1.0. Then we use
dvc run
to run the preprocessing on the raw dataset and store the way to obtain the dataset v0.1.0. Thus, we have two folders like raw_data
and dataset_v0.1.0
. We commit and push with git and dvc. At this step we would like to keep only dataset_v0.1.0
in the cache and remove raw_data
from cache.
- We have another developer to work on the project, so he/she uses git to clone the project and run
dvc checkout
or dvc pull
to get the latest data version state. After the last command he/she gets from remote both data: raw_data
and dataset_v0.1.0
.
Is it possible to handle the cache in such manner that we keep only the latest version without using in-place preprocessing for a given commit ?
Thank you
1 Like
Hi @vfdev.5 !
Great question! There proper way to go about it would be lock(dvc lock preprocessing.dvc
) your preprocessing stage after you are done with it(i.e. when your dataset_v0.1.0 is ready). By locking your stage you will prevent dvc from looking into dependencies of your stage, so that when you dvc push
it will not push cache for your wget
stage to the cloud. When you would like to update your raw_data or dataset, you would simply dvc unlock preprocessing.dvc
, do what you need to do and then dvc lock
it back before pushing. Here is how it would look like in commands:
$ mkdir myrepo
$ dvc myrepo
$ git init
$ dvc init
$ dvc run -o raw_data wget https://example.com/raw_data
$ dvc run -f preprocess.dvc \
-d preprocess.py -d raw_data \
-o dataset_v0.1.0 \
python preprocess.py raw_data dataset_v0.1.0
# Locking the stage to prevent dvc from checking its dependencies
# and also from pushing/pulling them.
$ dvc lock preprocess.dvc
$ git commit -am "Preprocess"
$ git push
# Will push dataset_v0.1.0 but won't push raw_data
$ dvc push
#
#
# Now on the colleague's side dvc pull will not pull raw_data
$ git pull
$ dvc pull
Also, if your raw_data
is stored on one of the supported remotes (external directory/s3/gs/ssh/hdfs) you can also utilize our external dependency
feature, that lets you remember the state of the external file and easily dvc repro
if it is changed(i.e. s3://mybucket/raw_data is replaced with a new version). E.g. it would look something like:
# This will download `s3://mybucket/raw_data` into `raw_data`,
# create `raw_data.dvc` that will have `s3://mybucket/raw_data`
# listed as a dependency(i.e. it will save ETAG) and later when
# you call `dvc repro` it will automatically check if `s3://mybucket/raw_data`
# had changed and if it did it dvc will automatically download it.
$ dvc import s3://mybucket/raw_data raw_data
$ dvc run -f preprocess.dvc \
-d preprocess.py -d raw_data \
-o dataset_v0.1.0 \
python preprocess.py raw_data dataset_v0.1.0
# Now you can utilize dvc lock the same as the example above.
# But when it is time to check if your original raw_data had been updated
# you can simply run
$ dvc repro
# That will go and check if s3://mybucket/raw_data is any different
# from what you had before and will automatically reproduce the whole
# pipeline for you
Here are some useful links:
dvc lock
command reference
dvc import
command reference
some info about external dependencies
Please feel free to follow up with any questions, we are always happy to help
Thanks,
Ruslan
3 Likes
Ruslan, thanks a lot for your prompt reply !