we are currently testing DVC for our ML-Project. We have to deal with a lot of large files which we only need occasionally. In order to not occupy the whole disk space, I am wondering, if there is the possibility to pull the actual data only when needed and delete it again when the work is done.
When I delete the actual file from the disk, the DVC status obviously changes. I’d like to omit this behavior as I still want the file to be tracked, I just don’t want an actual copy on my disk anymore.
It can be done with dvc gc command and different flags to it. But it might be not as granular as you’d like it to be.
In this command you are specifying explicitly (on a level of a commit) what data to keep in the local cache (or with -c it’s applied to the remote storage).
If you don’t use -c and did dvc push + git commit + git push - you are safe - files are tracked, and reference are safe as well.
Another option is to do dvc push + git commit + rm -rf .dvc/cache. Again, it’s not granular, but it is safe if files are already on the remote storage.