Best Practices: How to track data?

Max · April 12, 2022, 8:59am

Hello together,

I wounder if there are some best practices for tracking data with dvc, because there are multiple ways and I want to know the advantages and disadvantages.

For example, having a project setup as the following, where there are just a few files in the directories, but these are really large ( >=1Gb ) or having a lot of smaller files like images or mp3s.

Large files:

|--- data/
|  |--- # raw data files
|  |--- train.csv
|  |--- test.csv

Many files:

|--- data/
|  |--- img_01.jpg
|  |--- [...]
|  |--- img_10000000.jpg

I can add the folder to dvc by adding dvc add data/, what would create on data.dvc that tracks all files in one checksum. Or I can add the files with dvc add data/* which would create one .dvc file per file in the folder and than I need to add every new file manually.

Which way is recommended, what are the benefits of the different approaches? I can image tracking a lot of files with one .dvc per file can be a nightmare (having tonnes of files to track with git). I have also experienced that dvc checksum calculation can take a while if you track a whole folder.

Paffciu · April 13, 2022, 11:37am

Regarding the checksum calculation, I think that, overall, no matter which approaches you will use, the time will be similar. I would say that the rule of thumb would be to try to keep the number of .dvc files low.

Let’s take the image example: if you ever want to reuse the dataset, it’s easier to do dvc import {url} data_dir than do thousand of dvc imports, each per file.

The approach really depends on your use case. Do you have something specific in mind?

Max · April 25, 2022, 10:48am

Thanks for your answer. It was more like a general question. I have no special use case.

Topic		Replies	Views
Trying to understand data storage Questions	7	2322	October 30, 2022
DVC local storage usecase Questions	6	1605	January 20, 2021
Best practices: merge conflicts when traking folders Questions	1	73	April 19, 2024
Advice for versioning many many small files? Questions	8	3625	January 13, 2021
Using DVC without Cache (file references only) Questions	6	2864	August 26, 2021

Best Practices: How to track data?

Related topics