Is it Possible to train data in s3 bucket without downloading to local machine with DVC?

jeeva · December 22, 2022, 2:35am

Is it Possible to train data in s3 bucket without downloading to local machine with DVC?
If we have TBS of data, is it possible to train without taking to local machine?

shcheklein · December 22, 2022, 3:09am

@jeeva what is the lifecycle of that data in the cloud (who updates the files? do you delete files?)? Do you train on the whole “TB”-size dataset at once?

In general, it’s a very regular pattern, and I don’t think you need DVC to literally version that data, to move it around, etc. Usually, people would version a file with file names (I call it a “filter” file) in it to ensure reproducibility, also would version models, metrics, experiments with DVC. But not the “data” lake itself.

We are also developing a sister project atm that would be able to index metadata in the bucket (file names, annotations, etc, etc) and would provide an interface to query to create logical datasets (filter files) our of it. Think about this as a database with metadata. If you want we can chat and we can show you the product (repo is not open yet, it’s still early).

We are dev

jeeva · December 24, 2022, 5:56am

Hi, Currently we are storing dataset in S3 bucket and by using DVC pull downloading to local cache. As the dataset size increases we will have issues in downloading it to local machine/EC2 instance for training as we are using entire dataset together. I am not sure whether this comes under the scope of DVC as I couldn’t find any supporting documents.

dberenbaum · December 27, 2022, 12:52pm

Hi @jeeva! What do you want to do with DVC? Do you want it to version that data on the cloud so you can recover it later? Do you want to track it as part of a pipeline? Or maybe something else?

jeeva · December 29, 2022, 12:55am

I want to version data in a cloud that I am using for pipeline.

dberenbaum · December 29, 2022, 1:12pm

Thanks @jeeva! Do you have an existing dvc.yaml file?

You can specify a data dependency anywhere, including a cloud path.

Do you also need to track outputs on the cloud? By default, DVC will try to cache any outputs. That way, you can checkout a copy of any version of your pipeline and recover all the outputs. This means DVC needs somewhere to cache the data, which is usually local, but there is some support for external caches on the cloud, and we are working on using the built-in versioning capabilities of cloud providers where it’s enabled.

However, with TBs of data, it may be unrealistic to cache a copy of every version. In that case, you can set cache: false in dvc.yaml for your outputs and then specify cloud paths without worrying about where to cache them. The downsides are that you won’t be able to checkout old versions of uncached data, and if you work as a team, you have to be careful not to overwrite each other’s changes on the cloud (unlike local paths which are usually isolated by user).

nael · January 3, 2023, 4:40pm

Hi!
@dberenbaum I have a similar use case to OP so I’ll jump in if you don’t mind
I don’t really get the second part of your answer. Does that mean if the cache is not used I can only track the hashes of the data versions but can’t rollback to an older version ?

dberenbaum · January 4, 2023, 4:59pm

Yup, that’s correct @nael! You can tell what changed, but you can’t rollback to an older version of the data since there’s no copy of it cached anywhere.

nael · January 4, 2023, 6:25pm

Thanks @dberenbaum !
Makes sense, but how can I know what changed if there are no older versions to compare/diff with ?

dberenbaum · January 4, 2023, 6:27pm

Sorry, I wasn’t clear. You can tell whether data changed (based on the hashes). You cannot tell what changed.

fst · March 22, 2023, 2:04pm

Hi @shcheklein, is this project already available? Sounds very interesting!

shcheklein · March 22, 2023, 6:48pm

@fst It’s not released yet. Could you please reach to us at support@iterative.ai please? Share some very basic details about your team, data type and size. We’ll get back to you asap and it might be still a good opportunity to connect early, show you how it looks like, may be get you as a early user.

Topic		Replies	Views
Best practices for data stored on cloud? Questions	0	385	January 4, 2023
Access remote data instead of downloading it Questions	8	672	March 3, 2023
Tracking files stored in S3 without adding it into local storage Questions	4	1099	July 5, 2023
DVC local storage usecase Questions	6	1609	January 20, 2021
Direct copy between Shared Cache and External Dependencies/Outputs Questions	10	1847	June 3, 2021

Is it Possible to train data in s3 bucket without downloading to local machine with DVC?

Related topics