Handle cache to keep the latest version of data

Hi Ruslan,

I would like to ask the best practice on how to handle the following situation with DVC.
Let’s suppose we have a big raw dataset stored in the cloud.

  • So, we initialize a project (git + dvc) to work with this dataset. At first we would like to trace how to get the dataset locally. For this we use dvc run with commands as wget url to download the raw dataset. We commit changes etc.
  • Next we develop some scripts to preprocess the raw dataset into a dataset we call v0.1.0. Then we use dvc run to run the preprocessing on the raw dataset and store the way to obtain the dataset v0.1.0. Thus, we have two folders like raw_data and dataset_v0.1.0. We commit and push with git and dvc. At this step we would like to keep only dataset_v0.1.0 in the cache and remove raw_data from cache.
  • We have another developer to work on the project, so he/she uses git to clone the project and run dvc checkout or dvc pull to get the latest data version state. After the last command he/she gets from remote both data: raw_data and dataset_v0.1.0.

Is it possible to handle the cache in such manner that we keep only the latest version without using in-place preprocessing for a given commit ?

Thank you

1 Like

Hi @vfdev.5 !

Great question! There proper way to go about it would be lock(dvc lock preprocessing.dvc) your preprocessing stage after you are done with it(i.e. when your dataset_v0.1.0 is ready). By locking your stage you will prevent dvc from looking into dependencies of your stage, so that when you dvc push it will not push cache for your wget stage to the cloud. When you would like to update your raw_data or dataset, you would simply dvc unlock preprocessing.dvc, do what you need to do and then dvc lock it back before pushing. Here is how it would look like in commands:

$ mkdir myrepo
$ dvc myrepo
$ git init
$ dvc init
$ dvc run -o raw_data wget https://example.com/raw_data
$ dvc run -f preprocess.dvc \
          -d preprocess.py -d raw_data \
          -o dataset_v0.1.0 \
          python preprocess.py raw_data dataset_v0.1.0
# Locking the stage to prevent dvc from checking its dependencies
# and also from pushing/pulling them.
$ dvc lock preprocess.dvc
$ git commit -am "Preprocess"
$ git push
# Will push dataset_v0.1.0 but won't push raw_data
$ dvc push
# Now on the colleague's side dvc pull will not pull raw_data
$ git pull
$ dvc pull

Also, if your raw_data is stored on one of the supported remotes (external directory/s3/gs/ssh/hdfs) you can also utilize our external dependency feature, that lets you remember the state of the external file and easily dvc repro if it is changed(i.e. s3://mybucket/raw_data is replaced with a new version). E.g. it would look something like:

# This will download `s3://mybucket/raw_data` into `raw_data`,
# create `raw_data.dvc` that will have `s3://mybucket/raw_data`
# listed as a dependency(i.e. it will save ETAG) and later when
# you call `dvc repro` it will automatically check if `s3://mybucket/raw_data`
# had changed and if it did it dvc will automatically download it.
$ dvc import s3://mybucket/raw_data raw_data
$ dvc run -f preprocess.dvc \
          -d preprocess.py -d raw_data \
          -o dataset_v0.1.0 \
          python preprocess.py raw_data dataset_v0.1.0
# Now you can utilize dvc lock the same as the example above.
# But when it is time to check if your original raw_data had been updated
# you can simply run
$ dvc repro
# That will go and check if s3://mybucket/raw_data is any different
# from what you had before and will automatically reproduce the whole
# pipeline for you

Here are some useful links:
dvc lock command reference
dvc import command reference
some info about external dependencies

Please feel free to follow up with any questions, we are always happy to help :slightly_smiling_face:



Ruslan, thanks a lot for your prompt reply !