We have a HDFS data directory, which is only growing, so no files are deleted or edited. And we use these files for some machine learning stuff. Analysis A uses files for time period [1-a] and Analysis B uses files of time period [1-b] (see example below).
Since our (HDFS) data directory is only increasing, we wonder if it is possible to use DVC for tracking a “file list” or “file reference” only, instead of copying the current directory content into some cache.
For our case, something like a pointer would be enough and no cache directory for tracking deleted/edited files is needed.
For example:
Data File of Day 1
Data File of Day 2 ---> Run Analysis A (DVC must hold a reference to files of day 1 to 2)
Data File of Day 3
Data File of Day 4
Data File of Day 5 ---> Run Analysis B (DVC must hold a reference to files of day 1 to 5)
Data File of Day 6 ---> Run Analysis C (DVC must hold a reference to files of day 1 to 6)
Data File of Day 7
...
=> No cache needed, since data is only growing and not deleted or edited.
=> Every Analysis uses data from time period starting at day 1 to current day.
Is there a way to realize something with DVC? E.g. when I want to re-run analysis A, DVC knows which files to download from out HDFS data directory, without using a separate cache directory?
Thank you for your help
This is not really something you can do with DVC. The way that DVC tracks files is by storing them in the DVC cache so that they can be addressed by their hashes.
However, DVC only stores one copy of a file, so if your concern is that DVC would be storing multiple copies of everything in your directory (even though it has not changed across days/versions), that is not an issue. Using your example, the DVC cache would only contain a single version of a file that exists across “days 1 through 7”.
Thank you for the fast reply. And yes, one of our thoughts was, that we want to avoid storing the files multiple time. If this is not the case in the DVC cache, we’ll have a try on that solution, thank you.
@pmrowla Do I need to track each file separately to avoid storing multiple file copies in cache? Or can I also use DVC for tracking a whole HDFS directory for only caching the new files from commit to commit?
You can track the whole directory, DVC will identify which files already exist. It also matches files according to the actual binary file content, so if you have two files with different paths/names but identical content, DVC will only store one copy.
Hi, I’m new to DVC and this is the only thread where I could find this chache topic. I would like to disable the use of the cache folder so that I don’t have my tracked (large) files basically duplicated. I’m currenty using DVC to track a 5Gb dataset folder, and this of course makes the cache folder equally large, and it may come a time where the dataset it’s too big to allow having the same data duplicated as a cache.
@TheYisus96
You cannot disable the use of the cache directory, but what you can do is configure DVC to use the appropriate link types so that files are only stored in the cache directory, rather than keeping multiple copies of each file in both your workspace and the cache directory. This way your workspace only contains links (either symlink/hardlink/reflink) to the cache directory files.
Please refer to https://dvc.org/doc/user-guide/large-dataset-optimization for more information
1 Like