"dvc get" gets empty files for self-hosted data registry

Hi! I am new to DVC. Seems like a very helpful tool. However, I have been trying to configure a self-hosted data registry and I have faced an issue when downloading the data from the remote storage. I have followed the Data Registry tutorial but has not worked for me

So I have a server that I can access through ssh. My idea is to use this server as a Data Registry, so me and my colleagues can use it to centralize our data.

So first I created the data registry on the server:

mkdir /shared/test_dvc
cd /shared/test_dvc
git init
dvc init

And copied some data with the following structure:

.
└── data
   ├── 001.wav
   ├── 002.wav
   └── ... 

Then:

dvc add data
git add .gitignore data.dvc
git commit -m "first commit"

Then I create the storage in my server and push the data

dvc remote add storage /shared/storage
dvc push -r storage

Then on my laptop. In a new folder I run the following:

dvc list -R ssh://<ssh_server_name>:/shared/test_dvc

This shows only these files:

  • .dvcignore
  • .gitignore
  • data.dvc

And when running:

dvc get ssh://<ssh_server_name>:/shared/test_dvc data

It creates an empty directory data. Why am I not getting the data under the data directory stored in my remote server?

Hello,

I’m experiencing the same issue described by @jbs06 .

The only difference, as far as I can tell, is that I use the GitLab repository URL when attempting to download the data,

dvc get https://gitlab.com/<path-to-repo> data

I’m running,

  • Ubuntu 20.04
  • DVC 2.55.0
  • Git 2.25.1

Thanks for your support,
Oliver

Could you please share the verbose output please? Try adding --verbose to the end of that command.
Thanks.

Running,

$ dvc get https://gitlab.com/<path-to-repo> data --verbose

I get,

2023-05-01 07:48:53,240 DEBUG: v2.55.0 (snap), CPython 3.8.10 on Linux-5.15.0-67-generic-x86_64-with-glibc2.29
2023-05-01 07:48:53,240 DEBUG: command: /snap/dvc/1385/bin/dvc get https://gitlab.com/<path-to-repo> data --verbose
2023-05-01 07:48:53,444 DEBUG: Creating external repo https://gitlab.com/<path-to-repo>@None
2023-05-01 07:48:53,444 DEBUG: erepo: git clone 'https://gitlab.com/<path-to-repo>' to a temporary dir
2023-05-01 07:49:03,338 DEBUG: Removing '<local-dir-path>/.6FZjKHTuUVJ4cL7nFae5hW'                                                                               
2023-05-01 07:49:03,339 DEBUG: Analytics is enabled.
2023-05-01 07:49:03,361 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpnzgqbvia']'
2023-05-01 07:49:03,361 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpnzgqbvia']'

(I’ve replaced the actual path names with <x>)

If I try to use the get function to fetch a dataset that is a single file (instead of a directory), I get a different error,

2023-05-01 07:59:34,032 DEBUG: v2.55.0 (snap), CPython 3.8.10 on Linux-5.15.0-67-generic-x86_64-with-glibc2.29
2023-05-01 07:59:34,032 DEBUG: command: /snap/dvc/1385/bin/dvc get https://gitlab.com/<path-to-repo> data3.txt --verbose
2023-05-01 07:59:34,240 DEBUG: Creating external repo https://gitlab.com/<path-to-repo>@None
2023-05-01 07:59:34,240 DEBUG: erepo: git clone 'https://gitlab.com/<path-to-repo>' to a temporary dir
2023-05-01 07:59:42,679 DEBUG: Removing '<local-dir-path>/.8Bonut9yrX4rwyqmyw2nWG'                                                                               
2023-05-01 07:59:42,679 ERROR: unexpected error - Invalid private key
Traceback (most recent call last):
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc/cli/__init__.py", line 210, in main
    ret = cmd.do_run()
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc/cli/command.py", line 40, in do_run
    return self.run()
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc/commands/get.py", line 26, in run
    return self._get_file_from_repo()
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc/commands/get.py", line 33, in _get_file_from_repo
    Repo.get(
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc/repo/get.py", line 76, in get
    fs.get(
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc_objects/fs/base.py", line 609, in get
    return get_file(from_info, to_info)
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc_objects/fs/callbacks.py", line 69, in func
    return wrapped(path1, path2, **kw)
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc_objects/fs/callbacks.py", line 41, in wrapped
    res = fn(*args, **kwargs)
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc_objects/fs/base.py", line 596, in get_file
    self.fs.get_file(rpath, lpath, **kwargs)
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc/fs/dvc.py", line 388, in get_file
    return dvc_fs.get_file(dvc_path, lpath, **kwargs)
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc_objects/fs/base.py", line 522, in get_file
    self.fs.get_file(from_info, to_info, callback=callback, **kwargs)
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc_data/fs.py", line 130, in get_file
    fs, _, path = self._get_fs_path(rpath)
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc_data/fs.py", line 62, in _get_fs_path
    if fs.exists(fs_path):
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc_objects/fs/base.py", line 321, in exists
    return self.fs.exists(path)
  File "/snap/dvc/1385/lib/python3.8/site-packages/funcy/objects.py", line 47, in __get__
    return prop.__get__(instance, type)
  File "/snap/dvc/1385/lib/python3.8/site-packages/funcy/objects.py", line 25, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/snap/dvc/1385/lib/python3.8/site-packages/dvc_ssh/__init__.py", line 119, in fs
    return _SSHFileSystem(**self.fs_args)
  File "/snap/dvc/1385/lib/python3.8/site-packages/fsspec/spec.py", line 76, in __call__
    obj = super().__call__(*args, **kwargs)
  File "/snap/dvc/1385/lib/python3.8/site-packages/sshfs/spec.py", line 66, in __init__
    self._client, self._pool = self.connect(
  File "/snap/dvc/1385/lib/python3.8/site-packages/fsspec/asyn.py", line 115, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/snap/dvc/1385/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in sync
    raise return_result
  File "/snap/dvc/1385/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in _runner
    result[0] = await coro
  File "/snap/dvc/1385/usr/lib/python3.8/asyncio/tasks.py", line 494, in wait_for
    return fut.result()
  File "/snap/dvc/1385/lib/python3.8/site-packages/sshfs/utils.py", line 27, in wrapper
    return await func(*args, **kwargs)
  File "/snap/dvc/1385/lib/python3.8/site-packages/sshfs/spec.py", line 83, in _connect
    client = await self._stack.enter_async_context(_raw_client)
  File "/snap/dvc/1385/usr/lib/python3.8/contextlib.py", line 568, in enter_async_context
    result = await _cm_type.__aenter__(cm)
  File "/snap/dvc/1385/lib/python3.8/site-packages/asyncssh/misc.py", line 274, in __aenter__
    self._coro_result = await self._coro
  File "/snap/dvc/1385/lib/python3.8/site-packages/asyncssh/connection.py", line 8037, in connect
    new_options = cast(SSHClientConnectionOptions, await _run_in_executor(
  File "/snap/dvc/1385/lib/python3.8/site-packages/asyncssh/connection.py", line 515, in _run_in_executor
    return await loop.run_in_executor(
  File "/snap/dvc/1385/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/snap/dvc/1385/lib/python3.8/site-packages/asyncssh/connection.py", line 6423, in __init__
    super().__init__(options=options, last_config=last_config, **kwargs)
  File "/snap/dvc/1385/lib/python3.8/site-packages/asyncssh/misc.py", line 350, in __init__
    self.prepare(**self.kwargs)
  File "/snap/dvc/1385/lib/python3.8/site-packages/asyncssh/connection.py", line 7282, in prepare
    load_keypairs(cast(KeyPairListArg, client_keys), passphrase,
  File "/snap/dvc/1385/lib/python3.8/site-packages/asyncssh/public_key.py", line 3470, in load_keypairs
    read_private_key_and_certs(key_to_load, passphrase)
  File "/snap/dvc/1385/lib/python3.8/site-packages/asyncssh/public_key.py", line 3284, in read_private_key_and_certs
    key, cert = import_private_key_and_certs(read_file(filename), passphrase)
  File "/snap/dvc/1385/lib/python3.8/site-packages/asyncssh/public_key.py", line 3162, in import_private_key_and_certs
    raise KeyImportError('Invalid private key')
asyncssh.public_key.KeyImportError: Invalid private key

2023-05-01 07:59:42,707 DEBUG: Version info for developers:
DVC version: 2.55.0 (snap)
--------------------------
Platform: Python 3.8.10 on Linux-5.15.0-67-generic-x86_64-with-glibc2.29
Subprojects:
        dvc_data = 0.47.2
        dvc_objects = 0.21.1
        dvc_render = 0.3.1
        dvc_task = 0.2.0
        scmrepo = 1.0.2
Supports:
        azure (adlfs = 2023.1.0, knack = 0.10.1, azure-identity = 1.12.0),
        gdrive (pydrive2 = 1.15.3),
        gs (gcsfs = 2023.4.0),
        hdfs (fsspec = 2023.4.0, pyarrow = 11.0.0),
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = 2023.4.0, boto3 = 1.26.76),
        ssh (sshfs = 2023.4.1),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2023.4.0)

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2023-05-01 07:59:42,707 DEBUG: Analytics is enabled.
2023-05-01 07:59:42,730 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpwn5i9xph']'
2023-05-01 07:59:42,731 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpwn5i9xph']'

In this case, it looks as if the get function fails at establishing an ssh/scp connection due to an invalid private key - which I don’t understand since the default key in ~/.ssh/id_rsa is the right one.

Happy to do more debugging at my end, if you can point me in the right direction. Thanks for your help!

Oliver

One additional observation: Running get with the --show-url flag, I get the correct path,

ssh://<ip>:22/<path-on-server>/d1/6fb36f0911f878998c136191af705e

My ssh credentials are saved in the default location ~/.ssh/id_rsa and appropriate alias is found in .ssh/config.

Hi, are you able to ssh to that url?

Yes, I am able to ssh into the server with

ssh <ip>

I can also successfully do,

ssh ssh://<ip>:22