Dataset in another repository

dimrom · March 25, 2025, 2:28pm

Hello, I am very new to DVC, and I have a question.

I am trying to add my data to dvc, because my dataset is very heavy is was not included in my github repository, I cloned the repository and activated dvc, now I want to add the dataset with dvc add data but the path to the dataset is outside the github repository so I ended with this error:

ERROR: Cached output(s) outside of DVC project

So I was wondering how can I add a dataset located in the same machine but outside the project repository, I searched in the documentation but I didnt find anything regarding this.

shcheklein · March 25, 2025, 9:47pm

@dimrom by default the assumption is that your data is part of your project (and thus, let’s say, a training dataset is located in the same directory as the project). When you do dvc add data, data is put into .gitignore, so it won’t go into Git.

Could you describe a bit what kind of data, why is it located outside, is it a hard requirement, etc. There are other options, but I would need to know more to answer and recommend you something.

dimrom · March 25, 2025, 10:09pm

Hello @shcheklein thanks for your response.
Because the dataset is too big, 40 Gb I didnt pushed it on github, the dataset is composed of images and theirs annotations in jsonl.
The project is on github without the dataset, I cloned the repositoy on a new machine with the dataset located in another repository but on the same machine, it is a practice when you work with a big dataset to save it outside the project repository because the same dataset could serve to many projects.
We have a repository Datasets where we save many differents datasets, so someone just need to change the path in his project to access the dataset if he wants to work on.

So to resume, I cloned a project from github and I have the dataset on the same machine but outside the project repository and I would like to know how to use dvc with this setup.

shcheklein · March 25, 2025, 11:55pm

thanks!

We have a repository Datasets where we save many differents datasets, so someone just need to change the path in his project to access the dataset if he wants to work on.

but then you have a bunch of directories with duplicate files inside them in the Datasets, right? (just trying to see if you have a bit more sophisticated setup or not)

I would like to know how to use dvc with this setup.

you could use shared cache How to Share a Cache Among Projects - in this case there will be no duplicates + people can work on their own version of the datasets w/o affecting each other.
you could utilize also Data Registry (+ shared cache). One repo to track all datasets and their versions. Project repos then using dvc import to “fetch” data into them. Shared cache helps avoiding duplication.
you could use DataChain to index that directory with files (from each project), extract metadata from JSONs (to then filter by it, etc) and then to_storage command to instantiate with symlinks (to also avoid copying data).

If you are interested, we can jump on a call and discuss different options and I can show you some examples.

dimrom · March 26, 2025, 12:55am

thank you !

Inside Datasets there are many datasets

> Datasets/FirstDataset/train/...
>                      /val/...
>                      /train_annotations1.jsonl
>                      /val_annotations1.jsonl
>         /SecondDataset/train/...
>                       /val/...
>                       /train_annotations2.jsonl
>                       /val_annotations2.jsonl

I will try the first option this morning I hope it will work, I am not from US so it is very late now for a call, we can have it in ten hours or more I don’t know.

Topic		Replies	Views
Hi everyone! First question - How to point multiple projects to single dataset? Questions	5	1445	February 17, 2021
Trouble modifying and saving dvc data file which lives outside the repo Questions	22	3602	July 15, 2020
Using DVC for non-machine learning models Questions	1	817	October 2, 2020
Cache duplication for external dataset Questions	9	567	March 14, 2022
Dvc get error: Unable to find DVC file Questions	12	3272	June 20, 2021

Dataset in another repository

Related topics