Dvc get doesn't work with absolute paths, but everything else seems to

We are fairly new to DVC but we’ve been quite successful with it so far. Before introducing DVC into our project we already had the well-established practice of storing our generated artifacts in a fully qualfied root: /cv-pipeline/io/<something>. For example, here is a stage in our pipeline:

cmd: python imputation_3.py enc
deps:
- path: imputation_3.py
  md5: 0cfd78615e83fd2fa0f39498858fd693
- path: /cv-pipeline/io/merged_clean_enc.pkl
  md5: 292a590473f0507708f8369acff897dd
outs:
- path: /cv-pipeline/io/merged_miced_enc.pkl
  cache: true
  metric: false
  persist: false
  md5: 87f81f8604c6236d1bf9a9eb5f3e82fb
md5: d2c4019299ef38a1cb335d460c6e2c2e

We find that dvc pull and dvc push work fine the way we are doing this and that it recalculates the absolute path as a path relative to the root of the git repo. So we see lots of path-backups (i.e. ../) when dvc pull is processing stuff: ../../../../../merged_miced_enc.pkl. We use DVC in an entirely containerized setting where everything is highly automated and uniform. We know that /cv-pipeline/io is in the same location relative to every git repo in every workspace/container.

I only just tried using dvc get for the first time and discovered that it will not work against our project:

dvc get -o /tmp/bern/merged_miced_enc.pkl git@gitlab:core/cv-pipeline.git /cv-pipeline/io/merged_miced_enc.pkl
ERROR: failed to get '/cv-pipeline/io/merged_miced_enc.pkl' from 'git@gitlab:core/cv-pipeline.git' - unable to find DVC-file with output 'cv-pipeline/io/merged_miced_enc.pkl'

Something is removing the leading '/' of my filepath specification. I sort of get what is going on there, but since the output is fully qualified, it seems that dvc get could try finding the path precisely as I specified it before just giving up. You may question the use of fully qualified outputs and that is fair, but I have come be thankful that these large and numerous artifacts are not nested within our git repositories. This makes it easier to use trivial, normal methods to recursively search our source code without getting sophisticated and excluding these files. It actually feels great to have the outputs elsewhere in a well-known location outside of our git repo. It makes it less likely that devs still learning git and dvc make the mistake accidentally committing these artifacts to git somehow while they are messing with generated artifacts that they are introducing.

2 Likes

Hi @bemccarty !

Indeed, we did path.lstrip("/") on that as we were thinking that people are mostly misusing / in those commands. We didn’t think that anyone would be using it to access an external local output yet. But we were wrong :slight_smile: I’ve reverted that PR and am releasing 0.66.1 right now (should be on pip in half an our or so, but other packages will be updated a bit later in the day). Please give it a try and see if that works for you as expected.

Thank you so much for the feedback! :slightly_smiling_face:

I have no idea if this is related to the change just made or not (I’m guessing not actually) but now that I can run dvc get I am finding that it downloads the specified artifact just fine into some kind of temporary cache and then it creates my target filepath that was specified with -o as a symlink into that temporary cache. I don’t understand why it is using symlinks, but it is. Even stranger is that it then deletes the temporary cache folder thus deleting the target of the symlink. I am left with a dangling symlink. See the final DEBUG line below to see where it is deleting the temporary cache root.

> dvc get -v --rev bemccarty-dvc-files-committed-experiment -o ~/scrap/dvc-get-test/ git@gitlab.tstbcpacdc1lxv.geisinger.edu:fornwaltlab/core/cv-pipeline.git /cv-pipeline/io/merged_miced_enc.pkl 
DEBUG: Writing '/tmp/PowerShellTemp/64181/tmp1hrykzecdvc-erepo/.dvc/config.local'.
DEBUG: Writing '/tmp/PowerShellTemp/64181/tmp1hrykzecdvc-erepo/.dvc/config'.
DEBUG: Preparing to download data from 's3://cv-pipeline/dvc'
DEBUG: Preparing to collect status from s3://cv-pipeline/dvc
DEBUG: Collecting information from local cache...
DEBUG: cache '.dRu2xNgargGHTtJzsdAyZ4/87/f81f8604c6236d1bf9a9eb5f3e82fb' expected '87f81f8604c6236d1bf9a9eb5f3e82fb' actual 'None'                                                                                         
DEBUG: Collecting information from remote cache...                                                                                                                                                                         
DEBUG: Downloading 's3://cv-pipeline/dvc/87/f81f8604c6236d1bf9a9eb5f3e82fb' to '.dRu2xNgargGHTtJzsdAyZ4/87/f81f8604c6236d1bf9a9eb5f3e82fb'                                                                                 
Computing md5 for a large file .dRu2xNgargGHTtJzsdAyZ4/87/f81f8604c6236d1bf9a9eb5f3e82fb. This is only done once.
DEBUG: cache '.dRu2xNgargGHTtJzsdAyZ4/87/f81f8604c6236d1bf9a9eb5f3e82fb' expected '87f81f8604c6236d1bf9a9eb5f3e82fb' actual '87f81f8604c6236d1bf9a9eb5f3e82fb'                                                             
DEBUG: Checking out 'merged_miced_enc.pkl' with cache '87f81f8604c6236d1bf9a9eb5f3e82fb'.
DEBUG: checking if 'merged_miced_enc.pkl'('{'md5': '87f81f8604c6236d1bf9a9eb5f3e82fb'}') has changed.
DEBUG: 'merged_miced_enc.pkl' doesn't exist.
DEBUG: Created protected 'symlink': .dRu2xNgargGHTtJzsdAyZ4/87/f81f8604c6236d1bf9a9eb5f3e82fb -> merged_miced_enc.pkl
DEBUG: Removing '.dRu2xNgargGHTtJzsdAyZ4'

It is true that the git repository that I am using is indeed configured to use symlinks. I will hazard a guess that that is why dvc get ends up trying to use symlinks, but I don’t think it should do that. It seems to me that dvc get should just download the file where it was told to. If it has to go through the fancy cache.type honoring code to do that because that is how it is implemented, then it should take steps to ignore the cache.type value from the remote git repo and always use hardlinks,copy in that order.

@bemccarty it should be resolved in the lates DVC version. @kupruser could you please confirm that.

I just tried this with v0.66.6 and observed the same incorrect behavior as previously: The file is downloaded into a cache but is deleted at the very end and I end up with a dangling symlink.

1 Like

@bemccarty I’ve checked and it should not be using symlinks with dvc get - there is already code that changes the cache type to avoid symlinks. But … I was able to reproduce, so there is a bug somewhere. Looking into this right now, thanks!

@bemccarty So sorry for the delay! Good catch! Looks like we have a bug, where dvc tries to use link type that is configured for the external repo, instead of using reflink or copy for get. Same with import, where it should use link type from the repo you are importing into. Created https://github.com/iterative/dvc/issues/2744 , we will fix it ASAP. Thank you so much for the feedback!

The issue should be fixed by https://github.com/iterative/dvc/pull/2745 Thanks again! Stay tuned for new DVC version - it’s coming really soon.

@bemccarty 0.66.8 is now available through pip and conda (brew will be available later today). Please upgrade and give it a try :slight_smile: Thanks a lot for the feedback!

I just tested with v0.68.1 and verified that dvc get is working against my symlink’ish repo now. Thanks!

2 Likes