Hello together,
I wounder if there are some best practices for tracking data with dvc, because there are multiple ways and I want to know the advantages and disadvantages.
For example, having a project setup as the following, where there are just a few files in the directories, but these are really large ( >=1Gb ) or having a lot of smaller files like images or mp3s.
Large files:
|--- data/
| |--- # raw data files
| |--- train.csv
| |--- test.csv
Many files:
|--- data/
| |--- img_01.jpg
| |--- [...]
| |--- img_10000000.jpg
I can add the folder to dvc by adding dvc add data/
, what would create on data.dvc
that tracks all files in one checksum. Or I can add the files with dvc add data/*
which would create one .dvc
file per file in the folder and than I need to add every new file manually.
Which way is recommended, what are the benefits of the different approaches? I can image tracking a lot of files with one .dvc
per file can be a nightmare (having tonnes of files to track with git). I have also experienced that dvc checksum calculation can take a while if you track a whole folder.