Best practice for applying pipelines to many datasets?

steve.richardson · April 6, 2022, 3:20pm

Hi folks,

I have a data analysis problem which I want to use DVC to help me with. I have a series of many different datasets which each need to have an analysis stage applied to them. Each dataset is a folder with a specific set of (large) files, and the same python analysis code needs to be applied to each one.

Rather than just run one monolithic script that loops over all the datasets, I would like to create a pipeline to help me handle this. I am imagining some sort of pipeline where the script gets applied to each dataset separately (and possibly in parallel).

Is something like this even possible with dvc?

Thanks,

Steve

steve.richardson · April 6, 2022, 3:48pm

Actually, I just found out about “foreach” stages in the documentation:

I am pretty sure that I can use this to solve my problem.

Thanks!

daavoo · April 6, 2022, 5:21pm

One thing to note is that foreach does not currently support parallelization. There is an open issue discussing it dvc.yaml: future of foreach stages · Issue #5440 · iterative/dvc · GitHub

steve.richardson · April 6, 2022, 5:40pm

Thanks for pointing that out. While parallelization would be awesome for my use case, even having the foreach loop capability in serial seems like it will work for me.

Topic		Replies	Views
Managing pipelines operating per dataset element Questions	6	1661	January 13, 2021
Pipline with many parallel stages Questions	3	882	September 27, 2021
Coding Patterns Development	3	1098	April 28, 2020
Versioning predictions Questions	7	959	February 10, 2021
Dvc.yaml for feature analysis Questions	1	312	May 26, 2023

Best practice for applying pipelines to many datasets?

Related topics