Using DVC to version dataset outside a project

Hello, I am very new to DVC so maybe I wrong on what can DVC do.

Here is my problem, I have a dataset of videos that is quite big (so i would rather not duplicate it). I am doing a ML project on it, so I have a git repo for my code. The dataset can possibly evolve in the future (addition of new data, new labels, new features such as transcription etc). So to be reproducible, I was thinking to version the dataset with my code: so when I run experiments, I know on which version of the dataset it was. However it seems impossible to add a dataset if it is not stored within the project repo, even with symlink. I don’t want to move the dataset to the repo because the dataset may be used for other projects in the future. Is it possible to do this with DVC ? I cannot find in the documentation a way of addling dataset that are not in the project repo but that are stored locally. Note that thre is no plan to put the dataset online or on some cloud service.

Thanks for the help,

Ravi

If the dataset is always going to be local, but never in the repo, you could add an always_changes pipeline stage to create some identifying information about it (like a hash) and write it to a file. Then have your other stages depend on that file.

stages:
  calc-hash:
    cmd: python scripts/calc-hash.py
    deps:
      - scripts/calc-hash.py
    outs:
      - data/my-dataset-info.json
    always_changed: true
  gen-features:
    cmd: python scripts/gen-features.py
    deps:
      - scripts/gen-features.py
      - data/my-dataset-info.json
    outs:
      - data/features.csv