Best Practices: How to track data?

Hello together,

I wounder if there are some best practices for tracking data with dvc, because there are multiple ways and I want to know the advantages and disadvantages.

For example, having a project setup as the following, where there are just a few files in the directories, but these are really large ( >=1Gb ) or having a lot of smaller files like images or mp3s.

Large files:

|--- data/
|  |--- # raw data files
|  |--- train.csv
|  |--- test.csv

Many files:

|--- data/
|  |--- img_01.jpg
|  |--- [...]
|  |--- img_10000000.jpg

I can add the folder to dvc by adding dvc add data/, what would create on data.dvc that tracks all files in one checksum. Or I can add the files with dvc add data/* which would create one .dvc file per file in the folder and than I need to add every new file manually.

Which way is recommended, what are the benefits of the different approaches? I can image tracking a lot of files with one .dvc per file can be a nightmare (having tonnes of files to track with git). I have also experienced that dvc checksum calculation can take a while if you track a whole folder.

Regarding the checksum calculation, I think that, overall, no matter which approaches you will use, the time will be similar. I would say that the rule of thumb would be to try to keep the number of .dvc files low.

Let’s take the image example: if you ever want to reuse the dataset, it’s easier to do dvc import {url} data_dir than do thousand of dvc imports, each per file.

The approach really depends on your use case. Do you have something specific in mind?

Thanks for your answer. It was more like a general question. I have no special use case.