Hi, I’m trying to understand if DVC is a good solution for our company use-case. We currently have several tens of TB of data, and constantly adding to it every week. I would like to add versioning to this, so that our scientists can run experiments on various subsets and track all changes.
All the data is stored locally in a network drive accessible from the scientists’ computers. My question would be if DVC can be used in this way for this use-case, and how to get around copying the data multiple time (e.g. if 5 scientists access 40TB worth of data, this shouldn’t be copied to their versioned repos).
Can you give a bit more details on what your data storage looks like? Is it a single dataset, several ones, are you adding new files or appending to existing files? The best implementation of data versioning with DVC depends on those factors, but in general yes: this is something we aim to solve
For each project we have several experiments that can have subsets of these datasets for debugging the feature extraction stages
At any point we can add data to the existing datasets / projects
There can be 2-3 people working on one project at once, with various subsets of the datasets
I think this is a very common use-case in the industry. Right now we are organizing these with simple folders, and with files that list the current dataset that each experiment is using. This is getting out of hand as we add more and more data, and very easy to miss something.
The problem is that although there are many tutorials on DVC, there is not one that goes over more complex on-prem setups with ‘big’ data such as this one.
Thanks for the info! So it sounds like you don’t append to existing files, right? This is the most efficient way to version large datasets with DVC at the moment (although when chunking is implemented, it won’t matter as much). That’s because DVC has no awareness of data formats inside files, so changing even a single byte requires storing the whole file again as a separate version.
Were you able to check out the Data Versioning Tutorial? The Second model version section shows how to add files to tracked directories to create a new dataset version.
It’s just important to think through which directories you consider dataset “units” and will be tracking with dvc add. For example, in your case you can pick between tracking each Client*/<Y-m> or going higher level (just each Client*/)?
Each tracked directory will produce a .dvc file, so each strategy results in less or more .dvc files to manage (with Git). But keep in mind that “sync” commands support targeting files/dir inside tracked dirs granularly (see for ex. dvc checkout) so in either case your ability to pull and push specific files is the same.
For this pattern we have the Data Registries use case: you can have a DVC repo dedicated to versioning all your datasets, and then secondary DVC projects that get or import the specific dirs or files you need (also supports granularity)
Please also take a look at Pipelines as a way to start codifying your experiments in a manageable way.
Yep. What did you think of the “shared external cache” pattern? Does it help in your team’s org? In this case that setup would apply mainly to the secondary DVC projects consuming from the data registry (although even that repo could share it too).
@cotton84 we have essentially an append-only immutable pool of many, many files and each data scientist could pick any subset of those file to do some experimentation with them, right?
In this case I usually don’t recommend using the current version of DVC at all You are totally fine versioning a “filter” file- a file that contains specific files that are being used in the specific commit, specific project. It should be enough in terms of reproducibility and versioning if data in the storage is immutable and only new files are being added.
Now, you have another requirement though - avoid copying the data multiple time! This is indeed what shared cache/shared dev server DVC use case for. It’s very powerful and solves the problem. The question here is how to solve both simultaneously - flexibility (give data scientists and interface to checkout (in DVC terms) arbitrary subset of files to work with from the pool + avoid copying.
To come with a workaround, it would be helpful to understand how do you create this file in the:
with files that list the current dataset that each experiment is using
how do data scientists select the files they need from the pool of all those directories?
Thank you both for the quick replies! I was able to go through the tutorial and set up an external cache with a small-ish dataset yesterday, so everything is a bit more clear.
To answer your questions:
Yes, the files are generally immutable, we only add more files, or if necessary cull some (e.g. corrupted files or very old ones).
We would like to have fine granularity when working with datasets (individual files), but you are saying that it doesn’t matter if I add specific files (or recurse through directories), or directories themselves, they can be accessed individually either way, right?
With regards to external caches, I have a question - right now after I push the commit, the data gets copied to the external cache. Doe DVC automatically delete the files in the repository if it is set to hard link the files, so that there is only one copy of the file, the one in the cache?
The scientists have the big pool of files in the dataset. Depending on their experiment needs, they can select subsets, e.g. files that were recorded in one specific location, or within a specific timeframe. The files are not copied, just the lists are used to access them during training. To filter the files they use labels and metadata that are in separate ground-truth files, which brings me to another question:
Is there a recommended way to handle label files? These could change over time, adding of modifying metadata. Right now they are stored in a separate repository and they point to each of the dataset files in particular (have the same name). Would it be a good idea to let DVC also track this? Or is there a preferred tool (e.g. database) that can be used to better manage it?
This is a good point, the proposed file structure already implements basic versioning by file name (Client*/<Y-m> format).I think that this goes back to asking ourselves what we want to consider “data versioning” (more info: Data on the Web Best Practices).
But, regardless, the patterns previously discussed could still make it desirable to track those assets with DVC. It’s up to you
They can be in the same repository, and you can version those directly with Git along with any other code or config files.
But it’s not cleat to me that you’d still need some of those files (e.g. the lists that define a data subset) if/after adopting DVC. For example, dvc.yaml files already codify the dependencies for each stage in a data pipeline.