We have a HDFS data directory, which is only growing, so no files are deleted or edited. And we use these files for some machine learning stuff. Analysis A uses files for time period [1-a] and Analysis B uses files of time period [1-b] (see example below).
Since our (HDFS) data directory is only increasing, we wonder if it is possible to use DVC for tracking a “file list” or “file reference” only, instead of copying the current directory content into some cache.
For our case, something like a pointer would be enough and no cache directory for tracking deleted/edited files is needed.
Data File of Day 1 Data File of Day 2 ---> Run Analysis A (DVC must hold a reference to files of day 1 to 2) Data File of Day 3 Data File of Day 4 Data File of Day 5 ---> Run Analysis B (DVC must hold a reference to files of day 1 to 5) Data File of Day 6 ---> Run Analysis C (DVC must hold a reference to files of day 1 to 6) Data File of Day 7 ... => No cache needed, since data is only growing and not deleted or edited. => Every Analysis uses data from time period starting at day 1 to current day.
Is there a way to realize something with DVC? E.g. when I want to re-run analysis A, DVC knows which files to download from out HDFS data directory, without using a separate cache directory?
Thank you for your help