Data directory not tracked by git

Hi,
I’m new to dvc and have just created my first dvc controlled project and it has been a great experience until now.
Following the documentation, I’ve added the directory that contained all my data with

dvc add my_data_directory

DVC added my_data_directory accordingly to the top level .gitignore file.

The problem is now that I am not able to check out the remote storage of my data on a different machine, since
the root data-directory (my_data_directory) doesn’t exist in the git-repository. and accordingly missing when I clone the repo.

Additionally, our data is organized in sub-folders which i would like to track with git.

How could I achieve this? Does the .gitignore file need some adaption?

The structure of my data directory is the following:

my_data_directory
|
|
|–> subject_1_dir —> data
|–> subject_2_dir —> data
|–> subject_ 3_dir —> data

How could I tell git to track directory names, and dvc to track the containing files?

Thanks for your help

Hi @landge. The datasets have to be either tracked by dvc or Git.

Instead of tracking the whole data directory at the root, you can try selectively tracking some directories with DVC.

Eg:

dvc add my_data_directory/subject_1_dir

and dvc to track the containing files?

You cannot, you have to selectively track those datasets with dvc.

In your local machine, you can use dvc remove my_data_directory to untrack it by dvc, and then selectively dvc add directories that you want tracked.

Thank you @skshetry !
I removed the root data directory as you suggested and
added it again with its sub-directories with the --glob option like this:

dvc add --glob 'my_data_directory/sub-*'

This works great.
I realize that this approach creates a lot of .dvc files (around 16.000) in my case.
Do you think this might lead to any kind of trouble down the road??

Thanks again.

Do you have 16,000 subdirectories?

It may be slow as dvc has to parse many .dvc files.

You can try separating git-tracked datasets and dvc-tracked datasets into two subdirectories and add dvc-tracked dataset as a whole.
This of course depends on whether or not those subdirectories are individual dataset, in which it may make sense to add them separately.

No, sorry! I have more than 16.000 files but only 800 subdirectories…
So 800 .dvc files.

Performance will depend on what dvc operations you will use.

But 800 may still be too many. If you can, try to reduce them. Are those subdirectories individual dataset or parts of a dataset?

It’s the same dataset.
I’ll try to change the structure of my data directory.
Thank you.