so I have created a dvc to create a local raw data file based on a query of a gbq database.
now I want to add this file to dvc for tracking so if the database will not be available in the future I am fully reproducible.
When I do dvc add
I get the following error:
data.csv is specified as an output in more than one stage:
raw_data.dvc
This is not allowed. Consider using a different output name.
What am I doing wrong?
1 Like
Hi @hanan-vian!
Could you please share how does your raw_data.dvc
look like and how did you generate it?
If you were using dvc run -o data.csv ...
then data.csv
is already tracked by DVC and you don’t need to add it explicitly with dvc add
. In your case it looks like this is the most reasonable option.
But how do I make the csv reach the remote storage?
Will I be able to repro without accessing the database?
The dvc was generated as you said with explicit - o
1 Like
But how do I make the csv reach the remote storage?
That’s what dvc push
command is for (similar to git push
, but handles DVC tracked data). You need to setup a remote storage first. Do you use Google Cloud Storage?
Usually dvc remote add - d myremote gs://bucket/path
should be enough to being able to do dvc push
.
Will I be able to repro without accessing the database?
Yes! The workflow usually looks like:
git clone <you-repo>
dvc pull
(after that you will see your data.csv)
Then:
dvc repro
if you’d like to reproduce. But if all outputs (including models) are saved into DVC this command should show “Nothing to reproduce” after successful dvc pull
. There are explicit options to reproduce anyway.