Using DVC to version dataset outside a project

rhassana · August 25, 2025, 4:34pm

Hello, I am very new to DVC so maybe I wrong on what can DVC do.

Here is my problem, I have a dataset of videos that is quite big (so i would rather not duplicate it). I am doing a ML project on it, so I have a git repo for my code. The dataset can possibly evolve in the future (addition of new data, new labels, new features such as transcription etc). So to be reproducible, I was thinking to version the dataset with my code: so when I run experiments, I know on which version of the dataset it was. However it seems impossible to add a dataset if it is not stored within the project repo, even with symlink. I don’t want to move the dataset to the repo because the dataset may be used for other projects in the future. Is it possible to do this with DVC ? I cannot find in the documentation a way of addling dataset that are not in the project repo but that are stored locally. Note that thre is no plan to put the dataset online or on some cloud service.

Thanks for the help,

Ravi

petebachant · August 26, 2025, 8:16pm

If the dataset is always going to be local, but never in the repo, you could add an always_changes pipeline stage to create some identifying information about it (like a hash) and write it to a file. Then have your other stages depend on that file.

stages:
  calc-hash:
    cmd: python scripts/calc-hash.py
    deps:
      - scripts/calc-hash.py
    outs:
      - data/my-dataset-info.json
    always_changed: true
  gen-features:
    cmd: python scripts/gen-features.py
    deps:
      - scripts/gen-features.py
      - data/my-dataset-info.json
    outs:
      - data/features.csv

Topic		Replies	Views
Dataset in another repository Questions	4	81	March 26, 2025
Is it possible to version files independently? Questions	2	264	February 15, 2023
Using DVC for non-machine learning models Questions	1	811	October 2, 2020
DVC local storage usecase Questions	6	1620	January 20, 2021
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	788	March 22, 2023

Using DVC to version dataset outside a project

Related topics