Ignore files in stage external dependency/output

gregstarr · December 20, 2023, 10:57pm

Hello,

I am writing to a directory on a NAS in one stage and using that directory as a dependency in another stage. The NAS directory is an external dependency / output. When people using a mac go to this directory, it creates a .DS_Store file which makes DVC think that the dependency has changed. How can I have DVC ignore this file?

Thanks,
Greg

skshetry · December 21, 2023, 10:58am

Unfortunately, there is no way to ignore external dependency with DVC.

IIRC, there is a way to tell mac to not write a .DS_Store to a network drive.

gregstarr · December 22, 2023, 1:55pm

are external dependencies not recommended? They seem important for large datasets, but not being able to ignore files makes them significantly less convenient to use. DVC sees that the mtime of .DS_Store files is recent and so wants to re-hash the whole folder. This is somewhat contradictory to this post Maximum data size - #4 by skshetry which implies that we won’t have to rehash large datasets often

gregstarr · January 3, 2024, 2:53am

Hello,

Any additional recommendations here?

I’m not sure I can disable .DS_Store files from the NAS, it would have to be done from everyone’s computer individually. I did find this article about using veto files to block .DS_Store files from the NAS, but I don’t have admin access to the NAS.

This issue has caused me to re-run stages which deletes the data on my remote. Then I have to restore to a backup which takes a while.

kupruser · January 3, 2024, 2:28pm

I’m not sure how specifically you use that in your pipeline, but an easy workaround might be to have a stage that runs your script to download that data before feeding it into the next stage. That way you could ignore .DS_Store yourself.

We don’t plan on adding support for dvcignore to external deps any time soon, but contributions are welcome.

gregstarr · January 3, 2024, 5:08pm

I can’t fit the dataset on the machine which runs training. For a large dataset it would be pretty costly to re download the data every time you want to train anyway, right?

What is the intended workflow for large datasets? Are we not supposed to use NAS? Won’t this issue be present any time users want to use NAS?

kupruser · January 3, 2024, 5:14pm

The main issue is .DS_Stores really, I’m just suggesting potential workarounds for your scenario. If the data is big then indeed my suggestion won’t work. So looks like you’ll have to deal with .DS_Store one way or another. Maybe consider creating a first stage in your pipeline that will go around looking and deleting all .DS_Stores it can find before other stages use it.

gregstarr · January 3, 2024, 10:52pm

I thought DVC utilizes file mtime to determine whether to re-compute hashes for commands such as dvc status. Is that correct? If so, how should I set this up so that the mtime doesn’t appear changed?

Topic		Replies	Views
Setup DVC to work with shared data on NAS server Questions	10	15224	June 12, 2019
Direct copy between Shared Cache and External Dependencies/Outputs Questions	10	1867	June 3, 2021
Can and how DVC work with shared network storage? Questions	3	1605	January 1, 2020
DVC Heartbeat - Discord gems Announcements	3	4174	June 27, 2019
Shared cache directory Questions	14	3176	July 5, 2018

Ignore files in stage external dependency/output

Related topics