Raw data from google big query

so I have created a dvc to create a local raw data file based on a query of a gbq database.
now I want to add this file to dvc for tracking so if the database will not be available in the future I am fully reproducible.
When I do dvc add I get the following error:
data.csv is specified as an output in more than one stage:
raw_data.dvc
This is not allowed. Consider using a different output name.

What am I doing wrong?

1 Like

Hi @hanan-vian!

Could you please share how does your raw_data.dvc look like and how did you generate it?

If you were using dvc run -o data.csv ... then data.csv is already tracked by DVC and you don’t need to add it explicitly with dvc add. In your case it looks like this is the most reasonable option.

But how do I make the csv reach the remote storage?
Will I be able to repro without accessing the database?

The dvc was generated as you said with explicit - o

1 Like

But how do I make the csv reach the remote storage?

That’s what dvc push command is for (similar to git push, but handles DVC tracked data). You need to setup a remote storage first. Do you use Google Cloud Storage?

Usually dvc remote add - d myremote gs://bucket/path should be enough to being able to do dvc push.

Will I be able to repro without accessing the database?

Yes! The workflow usually looks like:

git clone <you-repo>
dvc pull
(after that you will see your data.csv)

Then:

dvc repro

if you’d like to reproduce. But if all outputs (including models) are saved into DVC this command should show “Nothing to reproduce” after successful dvc pull. There are explicit options to reproduce anyway.