Dvc pull --glob

Hi, I want to share just some of my files on a remote. For that I used dvc push --glob data/**/*.txt data , which seemed to work. But how can someone download and checkout only this subset? All other data files are stored just locally and were never pushed. The command dvc pull --glob data/**/*.txt throws a lot of errors, because dvc tries to download files from the remote that I never uploaded.
Thanks

Hi!

--glob right now has a very limited functionality, and it is only working on local files, even for dvc pull. However dvc pull supports specific targets (e.g. specific directory or specific file). Full glob support will be coming sometime in the future.

Hi,
if I understand that correctly, by now, it is possible to download a single folder or file but not a subset of a folder (with a glob expression). Is that correct?
Thanks

Correct. E.g. dvc pull path/to/datafile, dvc pull path/to/datadir/path/to/subdir, dvc pull path/to/datadir/path/to/subdir/subfile, etc

1 Like

Is there any update to this? I have lots of folders with lots of images and metadata files. I just need the metadata files from the remote.

Unfortunately, we have not been able to implement it yet. Please subscribe to pull: glob should not depend on what's in the workspace · Issue #7491 · iterative/dvc · GitHub for any updates. Thanks.

1 Like

@skshetry unfortunately issue 7491 has been closed as a duplicate of #5864. So interested parties should subscribe to #5864.

Is the only current workaround to this issue to use a shell script to pull the files we want?

  • assume the git repo is fully checked out
  • the script is given a combination of a path and file pattern (possibly in quotes)
    • for all .dvc files at the path:
      • script removes the .dvc extension from each file and checks the pattern against the filename or path
      • script performs dvc pull for the specific file

extra points:

  • parse each .dvc file for the outs/path and compare that value against the filename or path.

Is the only current workaround to this issue to use a shell script to pull the files we want?
Yes, because we only implement a very naive glob , it can only match what had already been in your workspace, this is useful in dvc push or dvc add but for the dvc pull outputs will not exist in your workspace only after you already pulled them down.