Hi,
I’m new to dvc and have just created my first dvc controlled project and it has been a great experience until now.
Following the documentation, I’ve added the directory that contained all my data with
dvc add my_data_directory
DVC added my_data_directory accordingly to the top level .gitignore file.
The problem is now that I am not able to check out the remote storage of my data on a different machine, since
the root data-directory (my_data_directory) doesn’t exist in the git-repository. and accordingly missing when I clone the repo.
Additionally, our data is organized in sub-folders which i would like to track with git.
How could I achieve this? Does the .gitignore file need some adaption?
The structure of my data directory is the following:
my_data_directory
|
|
|–> subject_1_dir —> data
|–> subject_2_dir —> data
|–> subject_ 3_dir —> data
How could I tell git to track directory names, and dvc to track the containing files?
Thank you @skshetry !
I removed the root data directory as you suggested and
added it again with its sub-directories with the --glob option like this:
dvc add --glob 'my_data_directory/sub-*'
This works great.
I realize that this approach creates a lot of .dvc files (around 16.000) in my case.
Do you think this might lead to any kind of trouble down the road??
It may be slow as dvc has to parse many .dvc files.
You can try separating git-tracked datasets and dvc-tracked datasets into two subdirectories and add dvc-tracked dataset as a whole.
This of course depends on whether or not those subdirectories are individual dataset, in which it may make sense to add them separately.