Best practice for applying pipelines to many datasets?

Hi folks,

I have a data analysis problem which I want to use DVC to help me with. I have a series of many different datasets which each need to have an analysis stage applied to them. Each dataset is a folder with a specific set of (large) files, and the same python analysis code needs to be applied to each one.

Rather than just run one monolithic script that loops over all the datasets, I would like to create a pipeline to help me handle this. I am imagining some sort of pipeline where the script gets applied to each dataset separately (and possibly in parallel).

Is something like this even possible with dvc?



Actually, I just found out about “foreach” stages in the documentation:

I am pretty sure that I can use this to solve my problem.


1 Like

One thing to note is that foreach does not currently support parallelization. There is an open issue discussing it dvc.yaml: future of foreach stages · Issue #5440 · iterative/dvc · GitHub

Thanks for pointing that out. While parallelization would be awesome for my use case, even having the foreach loop capability in serial seems like it will work for me.