Hello, I am very new to DVC, and I have a question.
I am trying to add my data to dvc, because my dataset is very heavy is was not included in my github repository, I cloned the repository and activated dvc, now I want to add the dataset with dvc add data but the path to the dataset is outside the github repository so I ended with this error:
ERROR: Cached output(s) outside of DVC project
So I was wondering how can I add a dataset located in the same machine but outside the project repository, I searched in the documentation but I didnt find anything regarding this.
@dimrom by default the assumption is that your data is part of your project (and thus, let’s say, a training dataset is located in the same directory as the project). When you do dvc add data, data is put into .gitignore, so it won’t go into Git.
Could you describe a bit what kind of data, why is it located outside, is it a hard requirement, etc. There are other options, but I would need to know more to answer and recommend you something.
Hello @shcheklein thanks for your response.
Because the dataset is too big, 40 Gb I didnt pushed it on github, the dataset is composed of images and theirs annotations in jsonl.
The project is on github without the dataset, I cloned the repositoy on a new machine with the dataset located in another repository but on the same machine, it is a practice when you work with a big dataset to save it outside the project repository because the same dataset could serve to many projects.
We have a repository Datasets where we save many differents datasets, so someone just need to change the path in his project to access the dataset if he wants to work on.
So to resume, I cloned a project from github and I have the dataset on the same machine but outside the project repository and I would like to know how to use dvc with this setup.
We have a repository Datasets where we save many differents datasets, so someone just need to change the path in his project to access the dataset if he wants to work on.
but then you have a bunch of directories with duplicate files inside them in the Datasets, right? (just trying to see if you have a bit more sophisticated setup or not)
I would like to know how to use dvc with this setup.
you could use shared cache How to Share a Cache Among Projects - in this case there will be no duplicates + people can work on their own version of the datasets w/o affecting each other.
you could utilize also Data Registry (+ shared cache). One repo to track all datasets and their versions. Project repos then using dvc import to “fetch” data into them. Shared cache helps avoiding duplication.
you could use DataChain to index that directory with files (from each project), extract metadata from JSONs (to then filter by it, etc) and then to_storage command to instantiate with symlinks (to also avoid copying data).
If you are interested, we can jump on a call and discuss different options and I can show you some examples.
I will try the first option this morning I hope it will work, I am not from US so it is very late now for a call, we can have it in ten hours or more I don’t know.