1T dataset with distributed deep learning

Hello,

I’m currently evaluating DVC for a very large scale Deep Learning project.

We have an ever growing dataset of +1M video files totaling to more than a Tb.
From this we do:

  1. preprocessing of each video file
  2. sub sample a subset of the original dataset, which will become the new training/val/test dataset. This dataset is currently at +400k video files totaling to 800G
  3. Train a distributed deep learning video classification model on aws with pytorch distributed

I have already used DVC extensively but only in a more local-first approach where I could run the whole pipeline on my machine and would just run the same pipeline on more powerful servers.

In my new use-case, I would like to scale the whole process on aws where all the data is on S3 buckets and steps of the pipeline is parallelized on light aws lambdas as much as possible.
I would also like to be able to launch the whole pipeline from my local machine (which would run all the processing on aws) but also on a CI job.

Has anyone ever had such a challenging setup ?
I feel like it’s the distributed parallel execution features that would be the most challenging to put in place with DVC.

Hello @kwon-young_veesion!
This is indeed an advanced use case. And while I am unable to find an example that would cover this exact use case I can try to provide you with some information where to start looking.

  1. Data versioning:
    I believe you could use two approaches here:
    a. using external dependencies/outputs - https://dvc.org/doc/user-guide/data-management/managing-external-data - the problem with external outputs is that when you do dvc checkout you override existing remote file, so parallelization is problematic as it would need some kind of synchronization allowing to make sure that your lambda job has the proper data before starting an experiment that is using different version of the data.
    b. using “normal” repo and dvc add --to-remote to version it, and on lambda utilize get_url() to obtain the data on lambda and process it the way you want.

Regarding the parallelization:
I am not sure I understand - Do you want to parallelize the steps of a single pipeline or by parallelization you mean running, for example, few experiments simultaneously?

Hello,

Thank you very much for your reply.

the problem with external outputs is that when you do dvc checkout you override existing remote file

It took me some time to understand this, but now I get it.

I believe your second solution would be the most elegant.
Use DVC normally but never locally pull the whole dataset.
It is a very interesting idea and I’ll try to make a poc with it. Thanks!

I am not sure I understand - Do you want to parallelize the steps of a single pipeline or by parallelization you mean running, for example, few experiments simultaneously?

Well, I would need to parallelize the processing of each video, but we are already evaluating the use of ray.io that can manage workers on aws. I suppose with DVC we could wrap the execution of the scheduler that manage the parallelization on aws.

For experiment, I already know how to use DVC experiments to parallelize them.

Thank you very much for your thoughts!

1 Like

Sure thing!
Feel free to ping us if any more questions arise!