I am currently evaluating DVC for my team.
I have been trying it out and like some of its features. However there are something I am a bit lost with.
We work on s3, we have a daily pipeline that downloads data from the internet, processes and spits out data in another format.
So far I managed to : configure a remote s3 repo, push a local file to it, play with run and repro. However this does not fit 80% of our workflow. Can you help me understand if the following things are possible:
- To push data on dvc, do I necessarily need to have the data on my local machine then push it to the remotely defined location ? Can’t I track directly a remote s3 parquet file ? e.g: dvc add s3://mybucket/folder-a/myfile.parquet -r myRemote ?
- If 1 is possible, if someone manually overwrites the s3://mybucket/folder-a/myfile.parquet, how will that affect the integrity of dvc ?
Thanks for the help.