Tracking files stored in S3 without adding it into local storage

With the introduction of DVC 3.0, file stored in AWS S3 remote storage can’t be tracked anymore. Is there any workaround for this problem?

I have tried adding files for tracking but getting this error. If anybody has solution for this problem,
Please let me know, Thanks in advance.

(py38_env) PS C:\dvc_research\snowflake_pvc> dvc add s3://dvc-demo-bucket-1/dataset/experimented.csv
ERROR: Cached output(s) outside of DVC project: s3://dvc-demo-bucket-1/dataset/experimented.csv. See <https://dvc.org/doc/user-guide/pipelines/exter
nal-dependencies-and-outputs> for more info.

Hi @Shubham477, have you taken a look at External Data?

Thanks @dberenbaum, But I have one question over here, Change in dataset can be tracked but can we go to old dataset without versioning enabled in AWS S3. Like if we update the tracked dataset with new dataset, my condition is we should be able to go back to old dataset if any update comes on it according to git checkouts

From the link I shared above:

During dvc push, DVC will upload the version of the data tracked by data.xml.dvc to the DVC remote so that it is backed up in case you need to recover it.

That means you can go back to the old dataset later by doing a git checkout of the commit that tracked the old dataset, then a dvc pull to download it and check it out locally.

However, this will not change what is on s3. It would be up to you to manually upload the old dataset back to the original s3 location if that’s what you want. The idea is that DVC can recover any old versions without touching the original dataset, which is safer in case you didn’t mean to overwrite or, other people were depending on it, etc.

I have suggested some changes to the docs to explain this better in link between imports and external deps/outputs by dberenbaum · Pull Request #4672 · iterative/dvc.org · GitHub. Please take a look and provide your thoughts on if it’s a helpful clarification. Thanks!

1 Like