Deleting a DVC tracked file from history

I have a dataset tracked by DVC which contains personal information about a group of people. Keeping privacy and security in mind, if I get a request from someone for having their data deleted from the dataset, how do I go about doing that from the entire history of the dataset? Just deleting the file and adding another DVC commit obviously won’t do because the data is still easily recoverable in the history.

I can see that dvc gc command does something along this line but it’s not clear to me from the documentation whether you can specify the file that you want deleted. Looks like it only deletes unused files in the history by default.

Is there any command that the deletion of targeted files from the entire DVC history?

@ananda
Hello!
So the answer depends on how you structure your project. If you only need to make sure the data related to latest commit is stored, solution is simple:

  1. Delete user file

  2. Commit the change to dvc and git.

  3. (I would recommend creating cache backup here, using gc needs to be done with caution.)

  4. Run dvc gc --workspace - dvc will remove all the data from project history that is not in current workspace - together with this particular users file. Please note: if you are using single cache/remote for different git repositories, data from other repositories will be deleted too!

  5. Check that current version of data does not contain data in question (eg dvc checkout) and delete backup (if done as in 3.)

To remove the data from remote too, you need to rerun dvc gc --workspace -r {remote_name} - please take into consideration what I wrote in point 4. It is really important to keep that in mind to not delete other data.

If you need to preserve other data in remote/cache, please ping us, we would need to consider our options here.

Also, can I ask you to create issue for command removing particular file/dir from cache? It seems to me we should have one, even hidden, so that kind of use cases could be handled.

Thank you for the response and the explanation. I’ll open an Issue :slightly_smiling_face:

Edit: Opened the issue here. https://github.com/iterative/dvc/issues/4903

@ananda
Thank you very much for that!
Is the solution with dvc gc --workspace enough to solve your problem?