I wounder if there are some best practices for tracking data with dvc, because there are multiple ways and I want to know the advantages and disadvantages.
For example, having a project setup as the following, where there are just a few files in the directories, but these are really large ( >=1Gb ) or having a lot of smaller files like images or mp3s.
|--- data/ | |--- # raw data files | |--- train.csv | |--- test.csv
|--- data/ | |--- img_01.jpg | |--- [...] | |--- img_10000000.jpg
I can add the folder to dvc by adding
dvc add data/, what would create on
data.dvc that tracks all files in one checksum. Or I can add the files with
dvc add data/* which would create one
.dvc file per file in the folder and than I need to add every new file manually.
Which way is recommended, what are the benefits of the different approaches? I can image tracking a lot of files with one
.dvc per file can be a nightmare (having tonnes of files to track with git). I have also experienced that dvc checksum calculation can take a while if you track a whole folder.