Secondly I want to be able to easily check if there are new logfiles. This should work with
dvc update data\file.txt.dvc , right?
yes, it’ll handle this
Another question: Will the file be sent over the network to calculate the md5 sum? I really want to reduce network usage. Thanks again.
I think yes unfortunately, it’ll be reading them via network to calculate the md5 again (to see if update is needed, and what files to we need to pass) @kupruser can confirm this.
A workaround here is to write a custom stages (
dvc stage add) and do not rely on
import-url. First stage would check if there are new updates. For example, if that log directory is immutable, append-only you could for example run
ls dir | wc -l > files.count or just
ls dir > files:
dvc stage add -o files "ls dir > files.list"
Then second stage that would depend on the
dvc stage add -d files.list -o data [command to sync missing files to data]
In the second stage file, you can consider using
--outs-persist to keep the previous version of the data when you run the stage (so that you can calculate delta and copy only missing files).