Best Practice - Test pipeline with smaller dataset?

w.jacob.ward · February 8, 2022, 2:44pm

I currently have a DVC pipeline that takes about 12 hours to run. I would like a way to test the entire pipeline on a small subset of my data, so that I can quickly verify each stage after a code change. Ideally, this would be an option on dvc repro. This leads me to think that I can configure my pipeline to have separate “test” versions of each stage, using the same cmds as their full counterparts, but using different parameters to reduce the time to reproduce.

Is this the correct approach? Is there a cleaner way to do this? How have other people solved this problem?

dberenbaum · February 8, 2022, 4:53pm

Hi! There’s no correct approach, but I and others have used a separate Git branch with a smaller dataset to do debugging. If that’s similar to what you are thinking, then it should not be a problem. Good luck!

Topic		Replies	Views
Best practice for applying pipelines to many datasets? Questions	3	929	April 6, 2022
Challenges with non-standard DVC Pipeline Questions	2	89	June 21, 2024
Need guidance with a use-case of benchmarking models Questions	2	232	January 30, 2024
Git Flow for DVC 🌿 General	5	8421	December 11, 2020
140-stage DVC pipeline getting hard to work with Questions	2	324	June 1, 2023

Best Practice - Test pipeline with smaller dataset?

Related topics