Workflow for pulling data added to project using dvc import

lusk · May 7, 2024, 4:31pm

Hi all, I’m having trouble wrapping my head around how to combine dvc import and dvc pull.

I have two DVC projects. One project is a registry, and the other project is a separate, standalone project that relies on some data in the registry. Let’s call this project.

On Computer A, I used dvc import to import some data from registry into project. registry and project do not share a cache, so the imported data was then copied into project’s cache. I also created some DVC pipeline stages, ran them, and the outputs were checked into DVC as well.

I then added a DVC Google Cloud Storage remote and ran dvc push from Computer A to push the data to the cloud. I also pushed the code and all DVC-related files to my git remote.

On Computer B I pulled the code (git pull) and then attempted to dvc pull. It pulled the outputs of the pipeline stages, but not the data that I had imported to project from registry on Computer A.

Am I missing a part of the puzzle here? I expected this combination of git push + dvc push from Computer A along with git pull and dvc pull on Computer B to enable me to relatively seamlessly replicate the project across machines, but that doesn’t seem to be the case. Should I have added the registry data to project using dvc get instead of dvc import? Is there a way to ensure that data added using dvc import is also able to be pulled to other machines in the future?

gregstarr · May 7, 2024, 7:24pm

dvc import should create a xyz.dvc file right? Is this file in git and pulled to computer B?

lusk · May 7, 2024, 9:22pm

Yes, but it seems like the issue is that the repo defined in the .dvc files points to the registry on the same filesystem of Computer A. I suppose that the solution was probably to import the files from the registry Github URL as opposed to directly from the local folder, or to set multiple remotes for those files such that they could be specified when pulling the data later…

dberenbaum · May 8, 2024, 12:20pm

dvc pull should pull imported data as well, although it will try to pull it from the registry remote rather than make another copy of it on project remote. Take a look at example-get-started/data/data.xml.dvc at main · iterative/example-get-started · GitHub. This is an import from GitHub - iterative/dataset-registry: Dataset registry DVC project. If you clone GitHub - iterative/example-get-started: Get started DVC project (NLP, random forest) and do dvc pull, you will see that it pulls the imported data.

I suppose that the solution was probably to import the files from the registry Github URL as opposed to directly from the local folder

Yes, this could be the issue. Computer B also needs to be able to access the URL you used in dvc import.

Topic		Replies	Views
Importing from a git repo, then pulling Questions	2	909	February 21, 2020
Retrieve data after using dvc import, but deleting original git repo Questions	2	1566	November 1, 2019
Hi everyone! First question - How to point multiple projects to single dataset? Questions	5	1423	February 17, 2021
Since v2, dvc pull does not pull imported files Bug Reports	3	1419	June 27, 2021
Working with a small subset of remote data Questions	3	1637	October 31, 2020

Workflow for pulling data added to project using dvc import

Related topics