Dvc with git sparse-checkout

Hello
I have a large number of datasets, each in its own subdirectory. So far, I was always pulling all datasets with git clone ... & dvc pull, which works perfectly.

Now, I would like to load only a subset of my datasets. I tried to archive this by doing a sparse checkout of the git repository. However, if I run git sparse-checkout init & git sparse-checkout add .dvc dataset_name followed by dvc pull, I get

ERROR: unexpected error - (b’worktreeConfig’, b’true’)

Is there a way to fix this, or maybe a better way to solve the problem in the first place?
I am aware that I could simply run dvc pull --recursive dataset_name - but given the large number of directories I am dealing with, I’d prefer to clone only those that I actually need.

Any hint is highly appreciated.

Hi, can you provide a full traceback by running pull with the -v flag?

Here the traceback:

Traceback (most recent call last):
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/dvc/cli/__init__.py", line 183, in main
    cmd = args.func(args)
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/dvc/cli/command.py", line 20, in __init__
    self.repo: "Repo" = Repo(uninitialized=self.UNINITIALIZED)
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/dvc/repo/__init__.py", line 248, in __init__
    self._ignore()
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/dvc/repo/__init__.py", line 416, in _ignore
    self.scm_context.ignore(file)
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/funcy/objects.py", line 28, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/dvc/repo/__init__.py", line 313, in scm_context
    return SCMContext(self.scm, self.config)
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/funcy/objects.py", line 28, in __get__
    res = instance.__dict__[self.fget.__name__] = self.fget(instance)
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/dvc/repo/__init__.py", line 301, in scm
    return SCM(self.root_dir, no_scm=no_scm)
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/dvc/scm.py", line 102, in SCM
    return Git(
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 98, in __init__
    first_ = first(self.backends.values())
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/funcy/seqs.py", line 55, in first
    return next(iter(seq), None)
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/_collections_abc.py", line 869, in __iter__
    yield self._mapping[key]
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/scmrepo/git/__init__.py", line 49, in __getitem__
    initialized = backend(*self.args, **self.kwargs)
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/scmrepo/git/backend/dulwich/__init__.py", line 146, in __init__
    self.repo = Repo.discover(start=root_dir)
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/dulwich/repo.py", line 1240, in discover
    return cls(path)
  File "/Users/marc/miniconda3/envs/py39/lib/python3.9/site-packages/dulwich/repo.py", line 1172, in __init__
    raise UnsupportedExtension(extension)
dulwich.repo.UnsupportedExtension: (b'worktreeConfig', b'true')

Unfortunately our default git backend (dulwich) does not currently support the worktreeconfig option. The good news is that it has just been merged Add support for worktreeconfig extension · jelmer/dulwich@a9bbc16 · GitHub and will be included in the next dulwich release, meaning we’ll be able to fix the issue as soon as the new dulwich release it out. For now, if it is an option, I’d suggest to disable sparse checkout.

I created dulwich broken with sparse checkout · Issue #175 · iterative/scmrepo · GitHub to track the issue.

1 Like