We are fairly new to DVC but we’ve been quite successful with it so far. Before introducing DVC into our project we already had the well-established practice of storing our generated artifacts in a fully qualfied root:
/cv-pipeline/io/<something>. For example, here is a stage in our pipeline:
cmd: python imputation_3.py enc deps: - path: imputation_3.py md5: 0cfd78615e83fd2fa0f39498858fd693 - path: /cv-pipeline/io/merged_clean_enc.pkl md5: 292a590473f0507708f8369acff897dd outs: - path: /cv-pipeline/io/merged_miced_enc.pkl cache: true metric: false persist: false md5: 87f81f8604c6236d1bf9a9eb5f3e82fb md5: d2c4019299ef38a1cb335d460c6e2c2e
We find that
dvc pull and
dvc push work fine the way we are doing this and that it recalculates the absolute path as a path relative to the root of the git repo. So we see lots of path-backups (i.e.
dvc pull is processing stuff:
../../../../../merged_miced_enc.pkl. We use DVC in an entirely containerized setting where everything is highly automated and uniform. We know that
/cv-pipeline/io is in the same location relative to every git repo in every workspace/container.
I only just tried using
dvc get for the first time and discovered that it will not work against our project:
dvc get -o /tmp/bern/merged_miced_enc.pkl git@gitlab:core/cv-pipeline.git /cv-pipeline/io/merged_miced_enc.pkl ERROR: failed to get '/cv-pipeline/io/merged_miced_enc.pkl' from 'git@gitlab:core/cv-pipeline.git' - unable to find DVC-file with output 'cv-pipeline/io/merged_miced_enc.pkl'
Something is removing the leading
'/' of my filepath specification. I sort of get what is going on there, but since the output is fully qualified, it seems that
dvc get could try finding the path precisely as I specified it before just giving up. You may question the use of fully qualified outputs and that is fair, but I have come be thankful that these large and numerous artifacts are not nested within our git repositories. This makes it easier to use trivial, normal methods to recursively search our source code without getting sophisticated and excluding these files. It actually feels great to have the outputs elsewhere in a well-known location outside of our git repo. It makes it less likely that devs still learning git and dvc make the mistake accidentally committing these artifacts to git somehow while they are messing with generated artifacts that they are introducing.