How to force dvc get

From my CI pipeline, I would like to download some files tracked using DVC.

data
└── test_data
    ├── tests_subfolder_1
    │   ├── file1.test
    │   └── file2.test
    ├── tests_subfolder_1.dvc
    ├── tests_subfolder_2

… etc.

When I run dvc get . data/test_data --out data/, I get the error:

ERROR: failed to get 'data/test_data/' from '.' - unable to remove 'data/test_data/test_subfolder.dvc' without a confirmation. Use `-f` to force.

This is surprising to me, because I would expect test_subfolder.dvc and test_subfolder/ to be able to coexist.

However, I don’t know where to put the -f argument? I tried:

  • dvc -f get . data/test_data --out data/
  • dvc get . data/tests_data --out data/ --force

I also tried to work around the problem with:
dvc get . data/test_data/* --out data/

However, my final solution was:
ls data/test_data/*.dvc | sed -e 's/\.dvc$//' | xargs -I {} dvc get . {} --out data/test_data/

This work-around feels over-complicated. What was I supposed to do?

Hi

-f or --force can go anywhere after dvc get.

However, I’m trying to understand your scenario here and I don’t quite get it. You are trying to “download” data already tracked in your own project, and put it in the same location it already exists? That’s what dvc get . data/test_data --out data/ looks like.

TBH I’m not even sure why we support . as the URL provided to get (and import). See the ref. here: get

Also, is there a data/test_data.dvc file in your project? Otherwise the command should fail, I believe. Maybe the error message is wrong and this is the problem. If so would you mind opening a bug report for that in Issues · iterative/dvc · GitHub? :slightly_smiling_face:

p.s. please note we’ve officially released DVC 1.0! Your project looks like it’s still using 0.x — we highly recommend migrating.

6 months later, I’m more comfortable with dvc and bash, I can better explain my problem. At the highest level, I wanted a one-liner that downloads all .dvc files in a given sub-folder and I didn’t want to write a for-loop. I don’t want to do dvc pull, because I’m not going to be using the version-control features in a CI pipeline and don’t need a cache.

The original directory structure I posted wasn’t clear. Here is an improved illustration where each DVC file points to a full folder.

data/
├── other_data.dvc
└── test_assets
    ├── rough_tests.dvc
    └── smooth_tests.dvc

DVC has gone through a lot of improvements over the last six months. Is there now a better alternative to my one-liner?

1 Like

You would use Git to sync .dvc files though. I’ll assume you mean to get the actual data tracked by DVC.

I’m still confused about the . URL given to get. So the environment in CI is a copy of the DVC repo, but you’re just trying to avoid dvc pull? If so a) you’re already using Git it so avoiding SCM doesn’t seem to be a concern (or I don’t get what the problem is); and b) similarly, “not needing a cache” shouldn’t be a concern? Think about that as some internal mechanism DVC uses during the download process which is not a problem: the final result is that you get the data files you want placed where you want them.

BTW note that pull accepts targets so you can tell it to only download the data in that folder (not everything in the project).

The only other path I can think of now would be to change the structure of your DVC repo. Instead of dvc adding each file in data/test_assets (resulting in multiple .dvc files) you can add the whole directory, so you can do this from anywhere in the CI (don’t even need to clone the repo):

$ dvc get https://<Git URL to DVC repo) data/test_assets --out data/

As for changes in DVC 1.x or 2.x (coming up very soon!), some commands are getting a new --glob option to accept wildcards but get doesn’t have it. Not sure if it’s planned but feel free to request that feature in our repo!

In hindsight, I don’t know why I was fixated on “not needing a cache”. I guess I just wanted my CI containers to consume as few resources (including disk space) as possible. However, I never checked my assumption that the cache even took up that much space in this case, so I’m just going to use dvc pull now.

1 Like

Sounds good @Seanny123 ! Here’s some info on how the cache is optimized in DVC (avoiding file duplication in the system): https://dvc.org/doc/user-guide/large-dataset-optimization

Best