Hello, I am very new to DVC so maybe I wrong on what can DVC do.
Here is my problem, I have a dataset of videos that is quite big (so i would rather not duplicate it). I am doing a ML project on it, so I have a git repo for my code. The dataset can possibly evolve in the future (addition of new data, new labels, new features such as transcription etc). So to be reproducible, I was thinking to version the dataset with my code: so when I run experiments, I know on which version of the dataset it was. However it seems impossible to add a dataset if it is not stored within the project repo, even with symlink. I don’t want to move the dataset to the repo because the dataset may be used for other projects in the future. Is it possible to do this with DVC ? I cannot find in the documentation a way of addling dataset that are not in the project repo but that are stored locally. Note that thre is no plan to put the dataset online or on some cloud service.
Thanks for the help,
Ravi