Data directory not tracked by git

landge · March 14, 2023, 9:35am

Hi,
I’m new to dvc and have just created my first dvc controlled project and it has been a great experience until now.
Following the documentation, I’ve added the directory that contained all my data with

dvc add my_data_directory

DVC added my_data_directory accordingly to the top level .gitignore file.

The problem is now that I am not able to check out the remote storage of my data on a different machine, since
the root data-directory (my_data_directory) doesn’t exist in the git-repository. and accordingly missing when I clone the repo.

Additionally, our data is organized in sub-folders which i would like to track with git.

How could I achieve this? Does the .gitignore file need some adaption?

The structure of my data directory is the following:

How could I tell git to track directory names, and dvc to track the containing files?

Thanks for your help

skshetry · March 14, 2023, 10:18am

Hi @landge. The datasets have to be either tracked by dvc or Git.

Instead of tracking the whole data directory at the root, you can try selectively tracking some directories with DVC.

Eg:

dvc add my_data_directory/subject_1_dir

and dvc to track the containing files?

You cannot, you have to selectively track those datasets with dvc.

In your local machine, you can use dvc remove my_data_directory to untrack it by dvc, and then selectively dvc add directories that you want tracked.

landge · March 14, 2023, 11:59am

Thank you @skshetry !
I removed the root data directory as you suggested and
added it again with its sub-directories with the --glob option like this:

dvc add --glob 'my_data_directory/sub-*'

This works great.
I realize that this approach creates a lot of .dvc files (around 16.000) in my case.
Do you think this might lead to any kind of trouble down the road??

Thanks again.

skshetry · March 14, 2023, 12:17pm

Do you have 16,000 subdirectories?

It may be slow as dvc has to parse many .dvc files.

You can try separating git-tracked datasets and dvc-tracked datasets into two subdirectories and add dvc-tracked dataset as a whole.
This of course depends on whether or not those subdirectories are individual dataset, in which it may make sense to add them separately.

landge · March 14, 2023, 12:39pm

No, sorry! I have more than 16.000 files but only 800 subdirectories…
So 800 .dvc files.

skshetry · March 14, 2023, 12:48pm

Performance will depend on what dvc operations you will use.

But 800 may still be too many. If you can, try to reduce them. Are those subdirectories individual dataset or parts of a dataset?

landge · March 14, 2023, 12:59pm

It’s the same dataset.
I’ll try to change the structure of my data directory.
Thank you.

Topic		Replies	Views
How to track file Questions	13	741	May 24, 2021
Best Practices: How to track data? Questions	2	1155	April 25, 2022
Manually deleted .dvc files -- these files still appear to be tracked by DVC Questions	1	487	January 22, 2021
How to permanently stop tracking a file/folder Questions	2	1172	February 16, 2023
Dataset in another repository Questions	4	61	March 26, 2025

Data directory not tracked by git

Related topics