I’m new to dvc and have just created my first dvc controlled project and it has been a great experience until now.
Following the documentation, I’ve added the directory that contained all my data with
dvc add my_data_directory
DVC added my_data_directory accordingly to the top level .gitignore file.
The problem is now that I am not able to check out the remote storage of my data on a different machine, since
the root data-directory (my_data_directory) doesn’t exist in the git-repository. and accordingly missing when I clone the repo.
Additionally, our data is organized in sub-folders which i would like to track with git.
How could I achieve this? Does the .gitignore file need some adaption?
The structure of my data directory is the following:
|–> subject_1_dir —> data
|–> subject_2_dir —> data
|–> subject_ 3_dir —> data
How could I tell git to track directory names, and dvc to track the containing files?
It may be slow as dvc has to parse many .dvc files.
You can try separating git-tracked datasets and dvc-tracked datasets into two subdirectories and add dvc-tracked dataset as a whole.
This of course depends on whether or not those subdirectories are individual dataset, in which it may make sense to add them separately.