Pipeline to process only new files

jonilaserson · July 16, 2020, 4:25pm

Say I have a million files in the directory ./data/pre.

I have a python script process_dir.py which goes over each file in ./data/pre and processes it and creates a file in the same name in a directory ./data/post (if such file already exists, it skips processing it).

I defined a pipeline:
dvc run -n process -d process_dir.py -d data/pre -o data/post python process_dir.py

Now let’s say I removed one file from data/pre.

When I run dvc repro it will still process all the 999,999 files again, because it will remove the entire content of the ./data/post directory before running the process stage. Is there any way to manage the pipeline so that process.py will not process the same file twice?

The only way I could think of is creating process_file.py which handles a single file, and executing 1 million commands like this:
dvc run -n process_1 -d process_file.py -d data/pre/1.txt -o data/post/1.txt python process_file 1.txt
dvc run -n process_2 -d process_file.py -d data/pre/2.txt -o data/post/2.txt python process_file 2.txt
…
dvc run -n process_1000000 -d process_file.py -d data/pre/1000000.txt -o data/post/1000000.txt python process_file 1000000.txt

Is there a more elegant way?

jorgeorpinel · July 16, 2020, 11:32pm

Say I have a million files in the directory ./data/pre.
I defined a pipeline:
dvc run -n process -d process_dir.py -d data/pre -o data/post python process_dir.py
When I run dvc repro it will still process all the 999,999 files again, because it will remove the entire content of the ./data/post directory

Correct. run and repro remove the outputs from the workspace (they’re still in .dvc/cache though). There are reasons for that, but I guess it would be nice to have an option to either enable or disable this behavior. Would you mind opening a feature request or discussion in Issues · iterative/dvc · GitHub @jonilaserson ?

rabefabi · July 24, 2020, 6:57am

Since I didn’t find a related feature request on Github, I created one: https://github.com/iterative/dvc/issues/4279

Let me know if it’s a duplicate

Topic		Replies	Views
Using DVC for end-to-end pipeline Questions	6	1607	January 5, 2019
Can dvc output modified files (similar to dvc diff) within a pipeline? Questions	2	27	April 25, 2025
Best practice for applying pipelines to many datasets? Questions	3	905	April 6, 2022
Processing data in place Questions	8	1422	April 29, 2020
Batch pipeline support and multi I/O Questions	2	1057	June 28, 2018

Pipeline to process only new files

Related topics