External dependency or import-url?


I have a large-ish set of files on a NAS. In my pipeline I preprocess these files into more convenient forms for ML training. I am trying to have repo A preprocess the raw files, then repo B imports the preprocessed files and performs model training. Both repo A and repo B use the NAS as their remote location.

The NAS is not mine alone so I don’t want to take up too much space on it. It sounds like if I do import-url on the files in repo A, dvc will want to push a copy of these files to its remote, which is also on the NAS, essentially creating two copies of the files in the same place. Is this correct?

To avoid this, should I instead specify the raw files as an external stage dependency? The reason I am hesitant to do this is because then I can’t ignore .DS_Store files which are created whenever a mac user explores the raw file directories on the NAS. I was hoping that I could ignore those files if I instead used import-url.

Any tips for this situation?