I’m trying to add a .csv file to an s3 bucket.
poetry run dvc add <data file>
gives the following error
Adding…
ERROR: output ‘filename’ is already tracked by SCM (e.g. Git)
The file doesn’t get added to the bucket.
I’m trying to add a .csv file to an s3 bucket.
poetry run dvc add <data file>
gives the following error
Adding…
ERROR: output ‘filename’ is already tracked by SCM (e.g. Git)
The file doesn’t get added to the bucket.
Hi @saumya31,
This error means that <data file>
is being tracked by Git. So it can’t also be tracked by DVC, unfortunately. You can use git rm --cached <data file>
first if you want to go from Git to DVC tracking.
However, I’m not sure I understand your goal. What do you mean by “add it to an S3 bucket”? DVC can track it for you, and if you setup an S3 data remote and then dvc push
, it’s contents will be stored there. But you won’t be able to find <data file>
in the bucket with the same file name, DVC will cache it in a special structure (which you can see in .dvc/cache
in your project without needing to push anything to a remote).
I am relatively new to dvc/git. I am confused as to why DVC cannot track changes if it is already tracked by Git if DVC works along side Git.
I am finding myself in this loop where every time I try to dvc repro <insert stage>
I end up having to git rmv → git commit the dependencies. How can I stop doing git rmv?
@Tom_K can you clarify your question? Are you getting an actual error message when you run dvc repro
? git rmv
is not a Git command, so it may depend on what you actually have git rmv
aliased to.
If dvc repro
makes changes to a file which is tracked by DVC (i.e. the stage output changed), you will have to git add
/git commit
the respective changes to the dvc.lock
file. That’s how data tracking works in DVC - the data file itself is store by DVC (in .dvc/cache
and in DVC remote storage), but the dvc.lock
or .dvc
files are tracked via Git.
If you have a pipeline output which you want to only be tracked by Git (and not tracked as DVC data), you need to mark it as cache: false
.
Regarding automating git add
/git commit
when running DVC commands, if you enable the core.autostage
config option in DVC, it will automatically do the git add
when the dvc.lock
file is modified in dvc repro
(but you will still have to do the git commit
yourself).
@pmrowla I think you are correct in just needing to use git add dvc.lock
. I haven’t had the issue occur since doing it.
Although, since I stopped tracking some files in git, I can’t seem to use git revert to undo having stopped tracking changes. Not a huge deal, I can always start over as this is a side project.
Ok, so I thought I had sorted it out but ended up coming to the same error ERROR: failed to reproduce stage <file> is already tracked by SCM (e.g. Git)
. This happened after trying to run dvc repro scrap_listings
a second time after having modified some some code.
I also found that whenever I run a script which pushes an output to intermediates/
those files automatically become added to .gitignore. Since these are dependencies I’d like to track these files. intemediates/.gitignore
was automatically added when I ran dvc run -n -d <file> -o <file> python ...
It looks like intermediates/saved_hyperlinks.txt
is still tracked by git. You will have to git rm
(and git commit
) that file if you want to track it as a DVC file instead.
Also note that stage outputs don’t have to be tracked as DVC data, you can also just continue tracking it with git if you prefer - you just need to mark that output as cache: false
in this case.