Listing files in the remote storage

Hi everyone,

We have pushed artifacts to remote storage (S3 bucket in our case). They got renamed and appear in S3 the way which makes it impossible to tell artifacts apart.

We would like to delete some artifacts that are not used anymore. But the bucket is used by multiple projects, so we cannot figure out what is what.
How to list the artifacts on the remote server?

dvs remote list only lists storages, not the contents of the storage.


This is also connected to a question of the best practice on organizing remote storages: is it better to have separate S3 buckets per project? Or one bucket, but separate folder for each artifact?

Thanks!

Another related question: how not to make the storage a mess? After some time it looks like a bunch of md5 named files/folders. Many of then can be orphans. How to approach this? Any tips/best practices?

Hi, @Versus. You could separate the remote storage and cache by namespacing them.

You could add a DVC remote as a s3://my-bucket/dvc-remote and then everything will have that prefix. You could create multiple DVC remotes this way.

dvs remote list only lists storages, not the contents of the storage.

It is better to use remote specific tools to visualize this. aws s3 ls can help you here.

is it better to have separate S3 buckets per project? Or one bucket, but separate folder for each artifact?

It depends on your preference and requirements, as DVC does not have any particular requirements.
You can use single or multiple buckets as you like. Or, use a namespacing with prefix and enable ACL.
You can even share a single remote across all of the projects. In that case, DVC will deduplicate your artefacts and data, but you need to be careful when using dvc gc.

Thank you for your reply.
aws s3 ls shows md5 named files and folders. In case we have some orphan files/folders there, we cannot figure out what can be deleted. Do you have a tip on how to fight this problem?

There’s dvc gc --cloud that does that, but as the files are mixed with the cache, I’d suggest you push to a different remote and then do something similar to the following on most of the md5 directories and remove manually:

aws s3 rm  --recursive s3://my-bucket/d3/