Hi,
I have a question regarding the import-url function of dvc. I have windows machines on a local network and want to import (logfiles) files from one machine to another. I have the windows path, password and ip address and I am already able to robocopy the files I want. Is it now possible to import those files?
The copy command that works is: robocopy \\{ipaddress}\c$\data\file.txt C:\Path\file.txt
import-url should work to my mind in that case. Do you see any problems with it?
Could you also clarify the scenario a bit? Do you want to keep that data outside \\{ipaddress}\c$\data\file.txt and keep importing newer versions when updates are available? Or do you want to take the current state of it and add into DVC?
Hi,
thank you for your quick reply. I have no Idea what was the issue yesterday, but today it seems to work. (Maybe the error message could be more verbose?) The idea is to copy the new logfiles from time to time to the local machine, so that the network usage is minimal. Secondly I want to be able to easily check if there are new logfiles. This should work with dvc update data\file.txt.dvc, right?
Another question: Will the file be sent over the network to calculate the md5 sum? I really want to reduce network usage. Thanks again.
Secondly I want to be able to easily check if there are new logfiles. This should work with dvc update data\file.txt.dvc , right?
yes, it’ll handle this
Another question: Will the file be sent over the network to calculate the md5 sum? I really want to reduce network usage. Thanks again.
I think yes unfortunately, it’ll be reading them via network to calculate the md5 again (to see if update is needed, and what files to we need to pass) @kupruser can confirm this.
A workaround here is to write a custom stages (dvc stage add) and do not rely on import-url. First stage would check if there are new updates. For example, if that log directory is immutable, append-only you could for example run ls dir | wc -l > files.count or just ls dir > files:
dvc stage add -o files "ls dir > files.list"
Then second stage that would depend on the files.list:
dvc stage add -d files.list -o data [command to sync missing files to data]
In the second stage file, you can consider using --outs-persist to keep the previous version of the data when you run the stage (so that you can calculate delta and copy only missing files).
The command is not pretty because robocopy uses more exit codes for success. Is there an easier way to update the data? At moment, I have to dvc unfreeze $$ dvc repro ImportData $$ dvc freeze $$ git commit ...
So, then it depends on how exactly does it detect the changes? If it (like DVC has to go and calculate hash) then we need to improve this (and we can discuss how). If it relies on timestamps only or some other simple way - then you should be fine, right? That single stage that you have should be efficient to bring only new files when they come. Unless I’m missing something.
Sorry that I didn’t point out my questions very well. In general, it is now working as I want. Two little notes:
For me it would be convenient, if dvc could filter the external files by timestamps first. (Maybe as an option.)
With the stage as in one of the previous answers, it would be convenient, if I could do something like dvc update ImportData. Right now, I have to dvc unfreeze ImportData && dvc repro ImportData && dvc unfreeze ImportData, just to update one data folder/file. It is probably not a good idea, to make wrap this command into a new stage, right?