Hello, I am very new to DVC so maybe I wrong on what can DVC do.
Here is my problem, I have a dataset of videos that is quite big (so i would rather not duplicate it). I am doing a ML project on it, so I have a git repo for my code. The dataset can possibly evolve in the future (addition of new data, new labels, new features such as transcription etc). So to be reproducible, I was thinking to version the dataset with my code: so when I run experiments, I know on which version of the dataset it was. However it seems impossible to add a dataset if it is not stored within the project repo, even with symlink. I don’t want to move the dataset to the repo because the dataset may be used for other projects in the future. Is it possible to do this with DVC ? I cannot find in the documentation a way of addling dataset that are not in the project repo but that are stored locally. Note that thre is no plan to put the dataset online or on some cloud service.
Thanks for the help,
Ravi
1 Like
If the dataset is always going to be local, but never in the repo, you could add an always_changes pipeline stage to create some identifying information about it (like a hash) and write it to a file. Then have your other stages depend on that file.
stages:
calc-hash:
cmd: python scripts/calc-hash.py
deps:
- scripts/calc-hash.py
outs:
- data/my-dataset-info.json
always_changed: true
gen-features:
cmd: python scripts/gen-features.py
deps:
- scripts/gen-features.py
- data/my-dataset-info.json
outs:
- data/features.csv
Hello Pete. In you suggestion, it is not clear whether the calc-hash pipeline stage uses DVC to version the external dataset into the DVC Cache, or if its a manual versioning without saving to cache (defeating the purpose of dvc)
It does not use DVC to version the dataset. It uses DVC to manage what happens if the dataset changes, and its hash is kept in the repo (actually it should just be kept in Git, not DVC, i.e., using cache: false in the output definition). This way, the outputs of the gen-features stage, for example, can be traced back to an exact hash of the source data, even though the data is kept out of the repo.