Trouble modifying and saving dvc data file which lives outside the repo

I have a dvc setup where the git/dvc repo, the data-file directory and the dvc-cache directory are all peers.
e.g.
my-test-repo
test-files
my-test-dvc-cache
are all directories at the same level

All the files under test-files were added using dvc add test-files

my-test-repo/test-files.dvc contains this line:
path: …/test-files

./dvc/config contains this line
[cache]
dir = …/…/my-test-dvc-cache

I did a dvc push and all the file went to the specified remote.
I did a git clone and a dvc pull (on several different machines) and all the files came down in the directory structure specified above.
Code which runs from the repo and uses the files works.

However, when I modify a data file, I am having trouble saving it.
I changed a file in the test-files directory and dvc status shows that test-files is modified.
$ dvc status
unit_test_input.dvc:
changed outs:
modified: …\test-files
changed checksum

When I do dvc commit it gives this error message:
$ dvc commit
ERROR: failed to commit - unable to commit changed stage: ‘test-files.dvc’. Use -f|--force to force.

So I entered dvc commit -f and it complains about files outside of the repo.
I read that it is ok to have files outside the repo and the original push and pull operations worked fine.

ERROR: unexpected error - Cmd(‘git’) failed due to: exit code(128)
cmdline: git ls-files C:\test-files
stderr: ‘fatal: C:\test-files: ‘C:\test-files’ is outside repository at ‘C:\test-files/my-test-repo’’

It seems like dvc doesn’t care that the data files are outside the repo, but the commit command is try to perform git commands on those files and git doesn’t like them being outside the repo.

Is that what’s going on?
Is there something I can do about it?

Hi @RSL !

Could you also show $ dvc version output, please?

Also, could you show full log dvc commit -f -v, please?

We do support files outside of repo, but I can’t say that we really recommend it, because there are quite a few caveats (like isolation if you are on a shared server) that make it a bit tricky to get right. Could you elaborate on why you can’t store your file in the repo?

As to the workaround, could you try dvc repro unit_test_input.dvc?

Thanks for the quick reply. Here is the version output:
dvc version
DVC version: 1.1.7
Python version: 3.7.5
Platform: Windows-10-10.0.18362-SP0
Binary: True
Package: exe
Supported remotes: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache: reflink - not supported, hardlink - supported, symlink - not supported
Filesystem type (cache directory): (‘NTFS’, ‘C:\’)
Repo: dvc, git
Filesystem type (workspace): (‘NTFS’, ‘C:\’)

$ dvc commit -f
ERROR: unexpected error - Cmd(‘git’) failed due to: exit code(128)
cmdline: git ls-files C:\Algo\unit_test_input
stderr: ‘fatal: C:\Algo\unit_test_input: ‘C:\Algo\unit_test_input’ is outside repository at ‘C:/Algo/sw_algo’’

Here is the output of trying to use repo

$ dvc repro unit_test_input.dvc
Verifying data sources in stage: ‘unit_test_input.dvc’
ERROR: failed to reproduce ‘unit_test_input.dvc’: Cmd(‘git’) failed due to: exit code(128)
cmdline: git ls-files C:\TestData_Zmirror\Algo\unit_test_input
stderr: ‘fatal: C:\TestData_Zmirror\Algo\unit_test_input: ‘C:\TestData_Zmirror\Algo\unit_test_input’ is outside repository at ‘C:/TestData_Zmirror/Algo/sw_algo’’

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

@RSL Could you please run dvc commit -f -v (notice the -v)?

In regards to why the data lives outside:
The data set is pretty large, and we want to have more than one git/dvc repo access the same data set.
Otherwise we would have to make copies of it.
For example, we have a Jenkins server with two jobs;
One to run the dev version of the code, one to run the release version of the code.
So they check out different branches to get different code, but both branches point to the same copy of the data, and we don’t have to have separate copies of the same data.

$ dvc commit -f -v
2020-07-10 18:47:10,826 DEBUG: Trying to spawn '['C:\\Program Files (x86)\\DVC (Data Version Control)\\dvc.exe', 'daemon', '-q', 'updater']'
2020-07-10 18:47:16,692 DEBUG: Spawned '['C:\\Program Files (x86)\\DVC (Data Version Control)\\dvc.exe', 'daemon', '-q', 'updater']'
2020-07-10 18:47:16,699 DEBUG: fetched: [(3,)]
2020-07-10 18:47:17,520 DEBUG: Path 'C:\TestData_Zmirror\Algo\unit_test_input' inode '3166988651231289851'
2020-07-10 18:47:17,520 DEBUG: fetched: [('d2deb874d59d018befb6a630a9080efb', '40165274755', '3e19d2d4a064727a5a3ee41091cdb433.dir', '1594430389307446784')]
2020-07-10 18:47:17,521 DEBUG: Computed stage: 'unit_test_input.dvc' md5: 'None'
2020-07-10 18:47:17,521 DEBUG: 'md5' of stage: 'unit_test_input.dvc' changed.
2020-07-10 18:47:17,587 DEBUG: fetched: [(2,)]
2020-07-10 18:47:17,596 ERROR: unexpected error - Cmd('git') failed due to: exit code(128)
  cmdline: git ls-files C:\TestData_Zmirror\Algo\unit_test_input
  stderr: 'fatal: C:\TestData_Zmirror\Algo\unit_test_input: 'C:\TestData_Zmirror\Algo\unit_test_input' is outside repository at 'C:/TestData_Zmirror/Algo/sw_algo''
------------------------------------------------------------
Traceback (most recent call last):
  File "dvc\main.py", line 53, in main
  File "dvc\command\commit.py", line 22, in run
  File "dvc\repo\__init__.py", line 36, in wrapper
  File "dvc\repo\commit.py", line 45, in commit
  File "dvc\stage\__init__.py", line 380, in save
  File "dvc\stage\__init__.py", line 391, in save_outs
  File "dvc\output\base.py", line 262, in save
  File "dvc\output\base.py", line 239, in ignore
  File "dvc\scm\git\__init__.py", line 277, in is_tracked
  File "site-packages\git\cmd.py", line 542, in <lambda>
  File "site-packages\git\cmd.py", line 1005, in _call_process
  File "site-packages\git\cmd.py", line 822, in execute
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git ls-files C:\TestData_Zmirror\Algo\unit_test_input
  stderr: 'fatal: C:\TestData_Zmirror\Algo\unit_test_input: 'C:\TestData_Zmirror\Algo\unit_test_input' is outside repository at 'C:/TestData_Zmirror/Algo/sw_algo''
------------------------------------------------------------

The data set is pretty large, and we want to have more than one git/dvc repo access the same data set.

That seems like a bad idea, because if someone will run dvc checkout, your file (accessed by multiple users/apps) will change as well. This is what I meant by isolation caveats above.

You don’t have to copy it, just use shared cache directory and a link type (symlink or hardlink) https://dvc.org/doc/user-guide/large-dataset-optimization .

So each repo winds up containing links to the same shared cache directory.
Can the shared cache directory live outside the dvc/git repo?

Sure, take a look at dvc cache dir command. And https://dvc.org/doc/use-cases/shared-development-server . :slightly_smiling_face:

Thanks, this is very helpful.
I would like to ask a couple of clarifying questions to make sure I’ve got the right idea now.

First question: Does this scenario sound correct?
We place the data files in the repo, and run dvc add
The files are replaced by links to the (outside) cache, and are put in a .gitignore file so git doesn’t track them.
Do a dvc push to get the files in the remote.
Do a dvc pull to get them to another machine.
A parallel git repo can also do a pull, and it will receive links to the same external cache.
This scenario is for running tests in parallel on a machine which does not modify data files.
That seems pretty straightforward.

Second question:
On a developers machine, what sequence should a developer follow to modify and push a dvc data file?

  1. That’s correct :slight_smile:
  2. Similar to a regular git workflow. Modify -> dvc add(or dvc repro) -> git add/commit the resulting metafiles -> dvc push -> git push.

I looked at the shared development server page.
It looks like we could keep the existing cache setup, and just put the data files inside the repo.
Am I right about this?

But if we’re using links we need to follow special procedures for do the modification; is that correct?

Also, we’re not using pipelines. The data is modified by hand at this time. What would repo do if you just modified a file?

I think so, yes.

It depends on the type of link (e.g. for reflinks you don’t need to do anything, but reflinks are not supported everywhere and specifically don’t seem supported on your fs, judging by the dvc version output). But yes, symlinks and hardlinks will be read-only, so in order to modify them you need to dvc unprotect them (it pretty much creates a copy that is ready to be edited). The rest of the workflow is the same. :slightly_smiling_face:

It would see the changes, same as with git. You could restore the commited version with dvc checkout, for example.

1 Like

Thanks for the help, this information is very useful.
We’re now keeping the data files in the repo but have left the cache files in a dir which is a peer to the repo, and I am able to modify data files, add data files, use dvc to commit and push the changes, and successfully pull them elsewhere.

This question is a bit off topic for the thread, but I’m not sure it’s worthy of a whole new thread.
Is there a simple way to ask for history of changes to specific files?
All these files are under the auspices of one .dvc file, and when I invoke dvc diff HEAD HEAD~ (for example) it just tells me the dvc file changed, but not which data files changed.

Is there a simple way to ask for history of changes to specific files?

There is no special command, but you can simply use git log for specific .dvc file. Would that work for you?