I have a dvc setup where the git/dvc repo, the data-file directory and the dvc-cache directory are all peers.
e.g.
my-test-repo
test-files
my-test-dvc-cache
are all directories at the same level
All the files under test-files were added using dvc add test-files
my-test-repo/test-files.dvc contains this line:
path: …/test-files
./dvc/config contains this line
[cache]
dir = …/…/my-test-dvc-cache
I did a dvc push and all the file went to the specified remote.
I did a git clone and a dvc pull (on several different machines) and all the files came down in the directory structure specified above.
Code which runs from the repo and uses the files works.
However, when I modify a data file, I am having trouble saving it.
I changed a file in the test-files directory and dvc status shows that test-files is modified.
$ dvc status
unit_test_input.dvc:
changed outs:
modified: …\test-files
changed checksum
When I do dvc commit it gives this error message:
$ dvc commit
ERROR: failed to commit - unable to commit changed stage: ‘test-files.dvc’. Use -f|--force to force.
So I entered dvc commit -f and it complains about files outside of the repo.
I read that it is ok to have files outside the repo and the original push and pull operations worked fine.
ERROR: unexpected error - Cmd(‘git’) failed due to: exit code(128)
cmdline: git ls-files C:\test-files
stderr: ‘fatal: C:\test-files: ‘C:\test-files’ is outside repository at ‘C:\test-files/my-test-repo’’
It seems like dvc doesn’t care that the data files are outside the repo, but the commit command is try to perform git commands on those files and git doesn’t like them being outside the repo.
Is that what’s going on?
Is there something I can do about it?
Also, could you show full log dvc commit -f -v, please?
We do support files outside of repo, but I can’t say that we really recommend it, because there are quite a few caveats (like isolation if you are on a shared server) that make it a bit tricky to get right. Could you elaborate on why you can’t store your file in the repo?
As to the workaround, could you try dvc repro unit_test_input.dvc?
$ dvc repro unit_test_input.dvc
Verifying data sources in stage: ‘unit_test_input.dvc’
ERROR: failed to reproduce ‘unit_test_input.dvc’: Cmd(‘git’) failed due to: exit code(128)
cmdline: git ls-files C:\TestData_Zmirror\Algo\unit_test_input
stderr: ‘fatal: C:\TestData_Zmirror\Algo\unit_test_input: ‘C:\TestData_Zmirror\Algo\unit_test_input’ is outside repository at ‘C:/TestData_Zmirror/Algo/sw_algo’’
In regards to why the data lives outside:
The data set is pretty large, and we want to have more than one git/dvc repo access the same data set.
Otherwise we would have to make copies of it.
For example, we have a Jenkins server with two jobs;
One to run the dev version of the code, one to run the release version of the code.
So they check out different branches to get different code, but both branches point to the same copy of the data, and we don’t have to have separate copies of the same data.
$ dvc commit -f -v
2020-07-10 18:47:10,826 DEBUG: Trying to spawn '['C:\\Program Files (x86)\\DVC (Data Version Control)\\dvc.exe', 'daemon', '-q', 'updater']'
2020-07-10 18:47:16,692 DEBUG: Spawned '['C:\\Program Files (x86)\\DVC (Data Version Control)\\dvc.exe', 'daemon', '-q', 'updater']'
2020-07-10 18:47:16,699 DEBUG: fetched: [(3,)]
2020-07-10 18:47:17,520 DEBUG: Path 'C:\TestData_Zmirror\Algo\unit_test_input' inode '3166988651231289851'
2020-07-10 18:47:17,520 DEBUG: fetched: [('d2deb874d59d018befb6a630a9080efb', '40165274755', '3e19d2d4a064727a5a3ee41091cdb433.dir', '1594430389307446784')]
2020-07-10 18:47:17,521 DEBUG: Computed stage: 'unit_test_input.dvc' md5: 'None'
2020-07-10 18:47:17,521 DEBUG: 'md5' of stage: 'unit_test_input.dvc' changed.
2020-07-10 18:47:17,587 DEBUG: fetched: [(2,)]
2020-07-10 18:47:17,596 ERROR: unexpected error - Cmd('git') failed due to: exit code(128)
cmdline: git ls-files C:\TestData_Zmirror\Algo\unit_test_input
stderr: 'fatal: C:\TestData_Zmirror\Algo\unit_test_input: 'C:\TestData_Zmirror\Algo\unit_test_input' is outside repository at 'C:/TestData_Zmirror/Algo/sw_algo''
------------------------------------------------------------
Traceback (most recent call last):
File "dvc\main.py", line 53, in main
File "dvc\command\commit.py", line 22, in run
File "dvc\repo\__init__.py", line 36, in wrapper
File "dvc\repo\commit.py", line 45, in commit
File "dvc\stage\__init__.py", line 380, in save
File "dvc\stage\__init__.py", line 391, in save_outs
File "dvc\output\base.py", line 262, in save
File "dvc\output\base.py", line 239, in ignore
File "dvc\scm\git\__init__.py", line 277, in is_tracked
File "site-packages\git\cmd.py", line 542, in <lambda>
File "site-packages\git\cmd.py", line 1005, in _call_process
File "site-packages\git\cmd.py", line 822, in execute
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
cmdline: git ls-files C:\TestData_Zmirror\Algo\unit_test_input
stderr: 'fatal: C:\TestData_Zmirror\Algo\unit_test_input: 'C:\TestData_Zmirror\Algo\unit_test_input' is outside repository at 'C:/TestData_Zmirror/Algo/sw_algo''
------------------------------------------------------------
The data set is pretty large, and we want to have more than one git/dvc repo access the same data set.
That seems like a bad idea, because if someone will run dvc checkout, your file (accessed by multiple users/apps) will change as well. This is what I meant by isolation caveats above.
You don’t have to copy it, just use shared cache directory and a link type (symlink or hardlink) Large Dataset Optimization .
Thanks, this is very helpful.
I would like to ask a couple of clarifying questions to make sure I’ve got the right idea now.
First question: Does this scenario sound correct?
We place the data files in the repo, and run dvc add
The files are replaced by links to the (outside) cache, and are put in a .gitignore file so git doesn’t track them.
Do a dvc push to get the files in the remote.
Do a dvc pull to get them to another machine.
A parallel git repo can also do a pull, and it will receive links to the same external cache.
This scenario is for running tests in parallel on a machine which does not modify data files.
That seems pretty straightforward.
Second question:
On a developers machine, what sequence should a developer follow to modify and push a dvc data file?
I looked at the shared development server page.
It looks like we could keep the existing cache setup, and just put the data files inside the repo.
Am I right about this?
It depends on the type of link (e.g. for reflinks you don’t need to do anything, but reflinks are not supported everywhere and specifically don’t seem supported on your fs, judging by the dvc version output). But yes, symlinks and hardlinks will be read-only, so in order to modify them you need to dvc unprotect them (it pretty much creates a copy that is ready to be edited). The rest of the workflow is the same.
It would see the changes, same as with git. You could restore the commited version with dvc checkout, for example.
Thanks for the help, this information is very useful.
We’re now keeping the data files in the repo but have left the cache files in a dir which is a peer to the repo, and I am able to modify data files, add data files, use dvc to commit and push the changes, and successfully pull them elsewhere.
This question is a bit off topic for the thread, but I’m not sure it’s worthy of a whole new thread.
Is there a simple way to ask for history of changes to specific files?
All these files are under the auspices of one .dvc file, and when I invoke dvc diff HEAD HEAD~ (for example) it just tells me the dvc file changed, but not which data files changed.