1T dataset with distributed deep learning

kwon-young_veesion · October 27, 2022, 3:47pm

Hello,

I’m currently evaluating DVC for a very large scale Deep Learning project.

We have an ever growing dataset of +1M video files totaling to more than a Tb.
From this we do:

preprocessing of each video file
sub sample a subset of the original dataset, which will become the new training/val/test dataset. This dataset is currently at +400k video files totaling to 800G
Train a distributed deep learning video classification model on aws with pytorch distributed

I have already used DVC extensively but only in a more local-first approach where I could run the whole pipeline on my machine and would just run the same pipeline on more powerful servers.

In my new use-case, I would like to scale the whole process on aws where all the data is on S3 buckets and steps of the pipeline is parallelized on light aws lambdas as much as possible.
I would also like to be able to launch the whole pipeline from my local machine (which would run all the processing on aws) but also on a CI job.

Has anyone ever had such a challenging setup ?
I feel like it’s the distributed parallel execution features that would be the most challenging to put in place with DVC.

Paffciu · October 29, 2022, 12:02pm

Hello @kwon-young_veesion!
This is indeed an advanced use case. And while I am unable to find an example that would cover this exact use case I can try to provide you with some information where to start looking.

Data versioning:
I believe you could use two approaches here:
a. using external dependencies/outputs - https://dvc.org/doc/user-guide/data-management/managing-external-data - the problem with external outputs is that when you do dvc checkout you override existing remote file, so parallelization is problematic as it would need some kind of synchronization allowing to make sure that your lambda job has the proper data before starting an experiment that is using different version of the data.
b. using “normal” repo and dvc add --to-remote to version it, and on lambda utilize get_url() to obtain the data on lambda and process it the way you want.

Regarding the parallelization:
I am not sure I understand - Do you want to parallelize the steps of a single pipeline or by parallelization you mean running, for example, few experiments simultaneously?

kwon-young_veesion · October 31, 2022, 9:56am

Hello,

Thank you very much for your reply.

the problem with external outputs is that when you do dvc checkout you override existing remote file

It took me some time to understand this, but now I get it.

I believe your second solution would be the most elegant.
Use DVC normally but never locally pull the whole dataset.
It is a very interesting idea and I’ll try to make a poc with it. Thanks!

I am not sure I understand - Do you want to parallelize the steps of a single pipeline or by parallelization you mean running, for example, few experiments simultaneously?

Well, I would need to parallelize the processing of each video, but we are already evaluating the use of ray.io that can manage workers on aws. I suppose with DVC we could wrap the execution of the scheduler that manage the parallelization on aws.

For experiment, I already know how to use DVC experiments to parallelize them.

Thank you very much for your thoughts!

Paffciu · October 31, 2022, 1:13pm

Sure thing!
Feel free to ping us if any more questions arise!

Topic		Replies	Views
DVC Heartbeat - Discord gems Announcements	3	4170	June 27, 2019
Access remote data instead of downloading it Questions	8	688	March 3, 2023
Best practice for handling large data Questions	5	2658	April 16, 2021
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	803	March 22, 2023
DVC - can’t I track directly an S3 remote data? Questions	1	1285	July 12, 2019

1T dataset with distributed deep learning

Related topics