Is it Possible to train data in s3 bucket without downloading to local machine with DVC?

Is it Possible to train data in s3 bucket without downloading to local machine with DVC?
If we have TBS of data, is it possible to train without taking to local machine?

@jeeva what is the lifecycle of that data in the cloud (who updates the files? do you delete files?)? Do you train on the whole “TB”-size dataset at once?

In general, it’s a very regular pattern, and I don’t think you need DVC to literally version that data, to move it around, etc. Usually, people would version a file with file names (I call it a “filter” file) in it to ensure reproducibility, also would version models, metrics, experiments with DVC. But not the “data” lake itself.

We are also developing a sister project atm that would be able to index metadata in the bucket (file names, annotations, etc, etc) and would provide an interface to query to create logical datasets (filter files) our of it. Think about this as a database with metadata. If you want we can chat and we can show you the product (repo is not open yet, it’s still early).

We are dev

Hi, Currently we are storing dataset in S3 bucket and by using DVC pull downloading to local cache. As the dataset size increases we will have issues in downloading it to local machine/EC2 instance for training as we are using entire dataset together. I am not sure whether this comes under the scope of DVC as I couldn’t find any supporting documents.

Hi @jeeva! What do you want to do with DVC? Do you want it to version that data on the cloud so you can recover it later? Do you want to track it as part of a pipeline? Or maybe something else?

I want to version data in a cloud that I am using for pipeline.

Thanks @jeeva! Do you have an existing dvc.yaml file?

You can specify a data dependency anywhere, including a cloud path.

Do you also need to track outputs on the cloud? By default, DVC will try to cache any outputs. That way, you can checkout a copy of any version of your pipeline and recover all the outputs. This means DVC needs somewhere to cache the data, which is usually local, but there is some support for external caches on the cloud, and we are working on using the built-in versioning capabilities of cloud providers where it’s enabled.

However, with TBs of data, it may be unrealistic to cache a copy of every version. In that case, you can set cache: false in dvc.yaml for your outputs and then specify cloud paths without worrying about where to cache them. The downsides are that you won’t be able to checkout old versions of uncached data, and if you work as a team, you have to be careful not to overwrite each other’s changes on the cloud (unlike local paths which are usually isolated by user).

Hi!
@dberenbaum I have a similar use case to OP so I’ll jump in if you don’t mind :smile:
I don’t really get the second part of your answer. Does that mean if the cache is not used I can only track the hashes of the data versions but can’t rollback to an older version ?

Yup, that’s correct @nael! You can tell what changed, but you can’t rollback to an older version of the data since there’s no copy of it cached anywhere.

Thanks @dberenbaum !
Makes sense, but how can I know what changed if there are no older versions to compare/diff with ?

Sorry, I wasn’t clear. You can tell whether data changed (based on the hashes). You cannot tell what changed.

1 Like

Hi @shcheklein, is this project already available? Sounds very interesting!

@fst It’s not released yet. Could you please reach to us at support@iterative.ai please? Share some very basic details about your team, data type and size. We’ll get back to you asap and it might be still a good opportunity to connect early, show you how it looks like, may be get you as a early user.