I have a Git/DVC repository that contains multiple commits and versions of datasets. Each dataset version is stored as a single binary file containing numerous records. My goal is to remove a specific subset of records from each version in the repository’s history. Essentially, I need to modify all dataset files throughout the repository’s history (rather than deleting the files from history as discussed in other threads using methods like dvc gc ...).
One possible straightforward solution is to crawl the DVC remote (S3), identify each data file, load it, remove the desired records, and save it back to the remote. However, I want to ensure that this approach won’t cause any issues in the future, so I would appreciate any input or advice.
Here are some thoughts and tests regarding this potential solution:
The main concern is that the modified files will be inconsistent with the checksums stored in the “*.dvc” file. As far as I know, there are four points where checksums play a role:
dvc pull: Initially, I expected this to be a major issue when pulling the modified files from the remote. However, it seems that by default, checksum validation is not performed during the pull process, as per my understanding of this issue. Also, in tests where we modified remote files and then pulled them, no errors were encountered.
dvc repro: After pulling the modified files, I would expect that running dvc repro would re-execute all stages that have the modified file as input since its content is inconsistent with the previous MD5 checksum. We plan to test this next. Surprisingly, when running dvc status after pulling the modified files, it reported no changes. If this is similar to dvc repro then maybe the comparison (if any) is between the MD5 of the file in the cache and the MD5 of the file in the workspace, rather than comparing the MD5 of the workspace file with the stored MD5 hash computed in the past. In summary, it appears that this approach might work.
dvc push: This step should not pose any problems.
dvc add: Similarly, this step should not cause any issues.
It’s worth noting that initially, we considered using dvc/git for this process, involving checking out each commit, loading the files, modifying them, and using “git-filter-branch.” However, we decided against this approach due to its complexity and the risk of making mistakes.
You are right, wrong checksums may cause an issue once you’ll be working with old commits. DVC may not spot it right away at pull, since the filepath in the remote won’t change, but working with them DVC will re-calculate hashes which can cause problems.
I wouldn’t suggest tinkering with the files in the remote unless there are no other options. The exact processing of files in remote may change from one version to another, so even if this works now, it’s not guaranteed to be working in the next DVC versions.
thanks for your reply! I can see that there’s a substantial risk even if the modification works with the current DVC version… and even that is difficult to test fully.
We are now looking into using git-filter-repo (seems much more robust than git-filter-branch ). The approach we have in mind has three steps. We would like to build a relatively general tool, since this problem may occur again in the future.
Based on a list target file paths, traverse the repository’s history, look at dvc.lock files and extract all md5s and sizes for each file in the list.
Go through remote storage, find the files from the list, load, modify as needed, overwrite original file, compute new md5 and size and retain that information. We’ll also need to rename the folders and files according to the new md5, I suppose.
Using git-filter-repo find the old md5 strings in dvc.lock files, replace md5 and size with new ones. It’s not yet clear whether this list-based replacement is easily done with git-filter-repo, especially concerning the sizes (for the hashes string replacement would be sufficient, but since sizes might not be unique we’ll have to actively navigate to the yaml key that has the md5).
As we see it’s pretty complex. Of course we will try it on a clone/copy of the repository and backend storage first.
Thanks again! Further comments, thoughts, warnings strongly appreciated!
PS: I wondered why this issue is not covered by dvc as opposed to removing complete files, which seems to be much easier using gc. Maybe too complex? Maybe it is bad practice to store monolithic files rather than individual records in the first place? Then again, it would be no option for us to store each record individually since this would amount to a huge number of files, and that’s probably common.