Best Practices: How to track data?

Hello together,

I wounder if there are some best practices for tracking data with dvc, because there are multiple ways and I want to know the advantages and disadvantages.

For example, having a project setup as the following, where there are just a few files in the directories, but these are really large ( >=1Gb ) or having a lot of smaller files like images or mp3s.

Large files:

|--- data/
|  |--- # raw data files
|  |--- train.csv
|  |--- test.csv

Many files:

|--- data/
|  |--- img_01.jpg
|  |--- [...]
|  |--- img_10000000.jpg

I can add the folder to dvc by adding dvc add data/, what would create on data.dvc that tracks all files in one checksum. Or I can add the files with dvc add data/* which would create one .dvc file per file in the folder and than I need to add every new file manually.

Which way is recommended, what are the benefits of the different approaches? I can image tracking a lot of files with one .dvc per file can be a nightmare (having tonnes of files to track with git). I have also experienced that dvc checksum calculation can take a while if you track a whole folder.

Regarding the checksum calculation, I think that, overall, no matter which approaches you will use, the time will be similar. I would say that the rule of thumb would be to try to keep the number of .dvc files low.

Let’s take the image example: if you ever want to reuse the dataset, it’s easier to do dvc import {url} data_dir than do thousand of dvc imports, each per file.

The approach really depends on your use case. Do you have something specific in mind?

1 Like

Thanks for your answer. It was more like a general question. I have no special use case.