Thanks for the info! So it sounds like you don’t append to existing files, right? This is the most efficient way to version large datasets with DVC at the moment (although when chunking is implemented, it won’t matter as much). That’s because DVC has no awareness of data formats inside files, so changing even a single byte requires storing the whole file again as a separate version.
Were you able to check out the Data Versioning Tutorial? The Second model version section shows how to add files to tracked directories to create a new dataset version.
It’s just important to think through which directories you consider dataset “units” and will be tracking with
dvc add. For example, in your case you can pick between tracking each
Client*/<Y-m> or going higher level (just each
Each tracked directory will produce a .dvc file, so each strategy results in less or more .dvc files to manage (with Git). But keep in mind that “sync” commands support targeting files/dir inside tracked dirs granularly (see for ex.
dvc checkout) so in either case your ability to pull and push specific files is the same.
For this pattern we have the Data Registries use case: you can have a DVC repo dedicated to versioning all your datasets, and then secondary DVC projects that
import the specific dirs or files you need (also supports granularity)
Please also take a look at Pipelines as a way to start codifying your experiments in a manageable way.
Yep. What did you think of the “shared external cache” pattern? Does it help in your team’s org? In this case that setup would apply mainly to the secondary DVC projects consuming from the data registry (although even that repo could share it too).