Hello!
I need some help to understand how DVC works with saving data. I have a little bit of a complicated setup and am not sure how to do it correctly.
- Data needs to be outside of workspace
- the initial data processing needs to be done on another server (server A), this data is copied to the main server (server B) and rest of development takes place from there (but in initial stages I need to redo the processing a few times )
- I have a remote set up but it’s very slow so I prefer to avoid this scenario
- dvc cache is configured on the data partition
what I want:
- all my data stored in data partition on server B (either in dvc cache or my own data folder)
- data is symlinked to the data dir in workspace
what I don’t understand is how to keep all the data out of the workspace, but symlinked.
Some questions:
- if I process data on server B, where do I set the output path? if I write to the data/ folder it’s stored in homedir which I don’t want
- if I scp data from server A, where do I store it on Server B?
One of the issues with the remote being slow is that I have a lot of image files. Initially I solved it by adding the tarballs to version control, but the untar-ed image folder is a dependency of a pipeline step, so when I run DVC push it takes prohibitively long. The progress bar starts with a reasonable number of files but it keeps resetting as it discovers new files or something?
so how do I solve this? I’m okay with starting the dvc versioning from scratch again, I haven’t done much data processing.
I’m currently struggling a lot with setting up because I mostly discover/learn things about DVC through the error messages. E.g., I first added my whole data folder, then discovered that I need to add files/folders individually. But the data folder was already tracked, and my dvc.yaml has a parametrised stage that I cannot remove in the command line. I have to manually edit the dvc.yaml and dvc.lock to remove it.
I hope it’s clear, I’ve tried to describe it best as I can, but I’m a bit confused on what the right path is !