Dvc_api.get_url is not working with external data

Hello,
I’m fairly new to DVC. I would like to explain my set up first and then seek help to make best use of DVC.

My Setup:

  • I want to use DVC to track and version external data. The data is large in size(~500GB+ currently) and Located on a NAS server which can be accessed with SSH. I’m referring to manage external data tutorial

  • I want to create a Data Registry with all of my external data and share it with multiple users to avoid multiple copies of the data using shared cache as explained in ‘How to share external Cache’

The steps I Followed

  1. On my local machine inside a git tracked directory, Created a DVC workspace with dvc_init.
  2. Configured a remote directory as dvc remote, let’s say at ssh://server/home/temp_registry
  3. Created a external cache in ssh://server/home/temp_registry/temp_cache as stated in manage external dataset
  4. I kept a folder named 'Dogs' at dvc remote, this folder has 2 subfolders, and 2 text files, each subfolders having some images. From my local machine used dvc add --external ssh://server/home/temp_registry/Dogs/. to add the external data under dvc tracking.
  5. timely commited .dvc/config to the gitlab repo from my local machine.

The output I got:

  1. A Dogs.dvc file was created on my local machine, which was commited to gitlab
  2. In cache directory under ssh://server/home/temp_registry/temp_cache, I found the cached files.

The exploration:
I tried to use python api on my local machine to fetch the url of the dvc remote/Dogs folder.
I used
dvc.api.get_url(path='Dogs.dvc', repo='gitlab_repo/project_name.git', rev=temp_registry_branch')

Expected output was the URL pointing to the Dogs dir, instead I received the error saying dvc.exceptions.OutputNotFoundError: Unable to find DVC file with output 'Dogs.dvc'

Now the questions:

  1. Is my approach of creating data registry using external data management techniques correct? and is it advisable to use DVC in such a way?
  2. Did I miss any step that should be carried out while setting up a registry on remote?
  3. was the usage of dvc_api.get_url correct? what can be done to get the url of Dogs folder located on remote?

Thanks :slight_smile:

1 Like

Hi @mvish7 !

  1. Is my approach of creating data registry using external data management techniques correct? and is it advisable to use DVC in such a way?

Yes and no. It’s a valid approach, but I’m not sure you are achieving with this what you are looking for. The external data management workflow means that you are (and your users) are reading data from the initial location ssh://server/home/temp_registry/Dogs/. DVC helps to version it, save previous versions, get back to them, etc. How to share external Cache won’t work this way. At least not for multiple users, since all of them will share the same location for the currently active version (dvc checkout) of the dataset. I hope that makes some sense. Happy to clarify further.

was the usage of dvc_api.get_url correct? what can be done to get the url of Dogs folder located on remote?

I would read this api: get_url() returns path to `.dir` for directory · Issue #3182 · iterative/dvc · GitHub and I know that DVC now has fsspec comptable API - you kinda can “moun” and use walk_files interface for the whole repo. I’m not 100% what the status is, but we can also give you more details if we see that it’s needed. For now, I’m not sure you need get_url.

Now, going back to the initial question. Is it the correct way of creating a data registry with NAS and shared cache. The way it is usually done:

  1. You mount your NAS so that ssh://server/home/temp_registry is available as a local dir. Ideally you should make the path the same for all data scientists. (if they always work via SSH - run their code, etc - steps are simpler but similar, you can avoid mounting)
  2. You setup your project to use an external cache: dvc cache dir /local/path/to/temp_cache
  3. You setup your project to use symlinks: dvc config cache.type symlink
  4. Make sure that you have a backup. Use like --to-cache to move your data into cache.
  5. Run dvc checkout and in the project tree you will get a directory that has is using symlinks inside to the cache. Thus avoiding copies + avoiding users affecting each other if they do dvc checkout on multiple different versions.

Again, happy to go into details.

Btw, could you describe please how you manage your data in the initial location, e.g. ssh://server/home/temp_registry/Dogs/ … how do you add data there, do you add / remove it at all?

Hi
Thanks for your reply.

About how do we manage data currently:
We keep usable data on NAS/temp_registy/Dogs, it is accompanied by a database. We add data manually by using scp/sftp. Data removal from ssh://server/home/temp_registry/Dogs/ happens rarely as it holds usable data.

I think I did not fully understand your suggestions here and hence I landed up in misc. errors while working with suggestions. Could you please explain the steps (may be a MWE) I need to take if I want to achieve the following:

  1. Get the data on NAS under dvc version control.
  2. Let multiple user use the data without creating several copies of the whole data / subset of the data in their ‘home’ directory on NAS. (I intend to keep shared cache on NAS, so other users can access it)

I tried to refer this, it did not help much.

Once you have your NAS mounted locally, you would setup a new DVC repository for your data registry, and configure the it to use the new shared cache location and symlinks as @shcheklein described in his reply

  1. Get the data on NAS under dvc version control.

To add your existing data to the new repository you would use dvc add with the -o flag as described here: add (this is the --to-cache feature @shcheklein mentioned, and again, be sure you have a backup of original your data before doing this transfer)

The end result here is that your data on the NAS will be transferred to it’s new location in the shared DVC cache directory (still on the NAS), and then your repository will contain symlinks that point to the data in the shared cache.

  1. Let multiple user use the data without creating several copies of the whole data / subset of the data in their ‘home’ directory on NAS. (I intend to keep shared cache on NAS, so other users can access it)

When working in their own home directories, users would set up their own DVC projects and use dvc import to import the specific data they need from your data registry.

When configuring their DVC projects, the users would follow the same setup as earlier. I.e. mount the NAS locally, and then configure the same shared cache location, and configure DVC to use symlinks. This way, when the user runs dvc import, their own DVC project would then contain symlinks to your data (that still only lives on the NAS in the single shared cache directory).

1 Like

Sure. Let’s try to setup a MWE together:

# Mock up the setup
# In reality user1 would be something like `/home/user1`, cache better
# to be located on the same volume as `/home` and data (which is `NAS/temp_registy/Dogs`)
% mkdir example-shared-cache
% cd example-shared-cache
% mkdir data
% mkdir user1
% mkdir user2
% echo "dog1" > data/dog1.txt
% echo "dog2" > data/dog2.txt
% mkdir cache
% cd user1
% mkdir project
% cd project
% git init

# Initialize DVC repo with a remote cache and enable all possible links to avoid copies
% dvc init
% dvc config cache.type "reflink,symlink,hardlink,copy"
% dvc cache dir /Users/ivan/Projects/example-shared-cache/cache
% git add .dvc/config

# Now we add the data finally. See it here https://dvc.org/doc/command-reference/add#example-transfer-to-an-external-cache and here https://dvc.org/doc/command-reference/add#-o
% dvc add ../../data -o data
% ls
data     data.dvc

% git add .gitignore data.dvc
% git commit -a -m "add data"
% # git push should go here to GH/GitLab/etc

# Now the second user comes ...
% cd ../user2
% git clone ../user1/project # in reality it would a clone from GH/GitLab or something
% cd project
% ls
data.dvc
% dvc checkout
% ls
data     data.dvc

Now, let’s say we’d like to add one more dog into the initial dataset:

% echo "dog3" > example-shared-cache/data/dog3.txt

To update the data in the repository, do this:

% cd user1/project
# This is a bug that you have to remove these, I'll create a ticket
% rm -rf data data.dvc
% dvc add ../../data -o data
% git add data.dvc ...
% git commit -m "update data"
% git push

There is an important caveat to keep in mind. Check the cache.shared option and configure it appropriately as well in the initial setup if it’s needed.

An interesting alternative to using dvc add ... -o is to use dvc import-url like this:

% dvc import-url /Users/ivan/Projects/example-shared-cache/data data

It’s similar to the dvc add ... -o but also saves in the .dvc file the source /Users/ivan/Projects/example-shared-cache/data, and later you could use dvc update data.dvc or something like this to update your data.

Another alternative is to setup an extra data registry repo as @pmrowla mentioned.

HI @pmrowla and @shcheklein

Thank you for detailed explanation and also for your patience.
I tried to follow the steps and somehow got stuck at an error. Below I’ll explain the steps I took

First, I mounted ssh://my_server/mvish7 to my local machine using ssfs.

  1. On my local machine, Created a dvc repo @ /home/mvish7/Desktop/dvc_repo by using dvc_init

  2. defined remote cache dir by dvc cache dir ssh://my_server/mvish7/registry_cache

  3. set cache type by dvc config cache.type symlink

  4. A sample of original data is kept at ssh://my_server/mvish7/Dogs, to add under dvc tracking on I used following: (inside a terminal on my local machine)
    dvc add ssh://my_server/Dogs -o Dogs

    4.a. At this stage, I also tried to use locally mounted NAS, as in:
    dvc add local_machine/NAS_mount_point/Dogs -o Dogs

At this point, I hit an error: ERROR: configuration error - config file error: extra keys not allowed @ data['protected']

To Investigate

  1. I checked the config file under /home/mvish7/Desktop/dvc_repo/.dvc, it contains only following configs:

[cache]
dir = ssh://my_server/home/mvish7/registry_cache
type = symlink

also elsewhere on my local machine under /home/mvish7/Desktop/dvc_repo, I did not find any config file which has data[‘protected’] field

  1. I changed permission using chmod 777 for ssh://my_server/mvish7/Dogs and ssh://my_server/mvish7/registry_cache.

  2. I tried to keep subset of the data my local machine and configure a remote cache as suggested in data registry tutorial
    I could use dvc add and then dvc push, in this case I did not hit any error.


Also could you please explain why should I mount the NAS locally? And once mounted where it should be used in the steps I took?

Many Thanks :slight_smile:

@mvish7 okay, before we jump into debugging the actual setup, let’s first agree on basic requirements.

You initially mentioned:

  1. Let multiple user use the data without creating several copies of the whole data / subset of the data in their ‘home’ directory on NAS. (I intend to keep shared cache on NAS, so other users can access it)

I assumed that you expect your DS/ML folks to SSH into that machine and do their experiments there. Is it correct? Or do you expect them to their work on their machines?

At this point, I hit an error: ERROR: configuration error - config file error: extra keys not allowed @ data['protected']

Could you please run it with -v?

[cache]
dir = ssh://my_server/home/mvish7/registry_cache
type = symlink

This configuration is not supported I believe. Only dvc remote can be an SSH / S3, etc. Cache should be on a file system (where at least you can symlink a file).

Hi,

I think this was the cause of the issue I was facing

With my current configs:

[core]
    remote = exp_reg
[cache]
    ssh = exp_reg
    type = symlink
    shared = group
['remote "exp_reg"']
    url = ssh:/my_server/home/mvish7/exp_cache

I can use dvc add ssh://my_server/home/mvish7/Dogs -o Dogs. This creates:

  1. Populates the exp_cache with cached files
  2. On my local machine, under the dvc_repo a folder named Dogs is created and a Dogs.dvc file is created. The Dogs folder now contains symlinks of the actual data on my remote.

Is it the expected behavior?


To answer this part:

I expect all of us to SSH into the machine and run experiment there.
In some cases, (small / Proof of Concept) experiments can be run on local machine. In such scenario, we are unsure if we need to add dvc based data versioning.

In my scenario: where we want to keep everything on remote server.
Do you encourage to set up a data registry by using a python + dvc installed environment on the remote server itself? i.e. instead of using local machine to set up a registry should I directly set up a registry on remote?

Then please disregard anything where you mount NAS, where you do something like dvc add ssh://, etc. In your case everything is happening pretty much on a single machine, right? Everyone already have storage?

Have you tried to follow this Dvc_api.get_url is not working with external data - #5 by shcheklein … I specifically made with “everyone is working from their home directory on the same machine”.

On my local machine … The Dogs folder now contains symlinks of the actual data on my remote.

Hmm, how can local machine symlink via ssh? It’s not possible. Most likely those symlinks go to something like local_machine/dvc_repo/.dvc/cache? And the remote location ssh:/my_server/home/mvish7/exp_cache should be populated only after you do dvc push?

I think this was the cause of the issue I was facing

If you could still reproduce this, that would be good to create a ticket to fix it

Yes. In my case, Everything is happening on single machine which has Storage space and Computing Power.


I have not yet tried it out. Just to clarify, As the machine (where we have got storage + computing power) is ssh accessible from my local machine, should I try to execute these steps directly on remote machine instead of my local machine?


Upon double checking, the symlinks have gone in local_machine/dvc_repo/.dvc/cache and I had to use dvc push to populate the remotely located exp_cache.

is there a way to avoid symlinks going to local_machine/dvc_repo/.dvc/cache ?

Yes, you could definitely try it. I was trying to imitate the workflow with a single directory, imitating multiple user directories, etc. Just to try it faster, but my intention was to make it work for the scenario you were describing - multiple people SSH-ing into a single box, everyone has her own /home/<name>/project directory (which the clone with git clone initially) and they run everything inside that project.