I am writing to a directory on a NAS in one stage and using that directory as a dependency in another stage. The NAS directory is an external dependency / output. When people using a mac go to this directory, it creates a .DS_Store file which makes DVC think that the dependency has changed. How can I have DVC ignore this file?
Unfortunately, there is no way to ignore external dependency with DVC.
IIRC, there is a way to tell mac to not write a
.DS_Store to a network drive.
are external dependencies not recommended? They seem important for large datasets, but not being able to ignore files makes them significantly less convenient to use. DVC sees that the mtime of .DS_Store files is recent and so wants to re-hash the whole folder. This is somewhat contradictory to this post Maximum data size - #4 by skshetry which implies that we won’t have to rehash large datasets often
Any additional recommendations here?
I’m not sure I can disable .DS_Store files from the NAS, it would have to be done from everyone’s computer individually. I did find this article about using veto files to block .DS_Store files from the NAS, but I don’t have admin access to the NAS.
This issue has caused me to re-run stages which deletes the data on my remote. Then I have to restore to a backup which takes a while.
I’m not sure how specifically you use that in your pipeline, but an easy workaround might be to have a stage that runs your script to download that data before feeding it into the next stage. That way you could ignore .DS_Store yourself.
We don’t plan on adding support for dvcignore to external deps any time soon, but contributions are welcome.
I can’t fit the dataset on the machine which runs training. For a large dataset it would be pretty costly to re download the data every time you want to train anyway, right?
What is the intended workflow for large datasets? Are we not supposed to use NAS? Won’t this issue be present any time users want to use NAS?
The main issue is
.DS_Stores really, I’m just suggesting potential workarounds for your scenario. If the data is big then indeed my suggestion won’t work. So looks like you’ll have to deal with
.DS_Store one way or another. Maybe consider creating a first stage in your pipeline that will go around looking and deleting all
.DS_Stores it can find before other stages use it.
I thought DVC utilizes file mtime to determine whether to re-compute hashes for commands such as
dvc status. Is that correct? If so, how should I set this up so that the mtime doesn’t appear changed?