Say I have a million files in the directory ./data/pre
.
I have a python script process_dir.py
which goes over each file in ./data/pre
and processes it and creates a file in the same name in a directory ./data/post
(if such file already exists, it skips processing it).
I defined a pipeline:
dvc run -n process -d process_dir.py -d data/pre -o data/post python process_dir.py
Now let’s say I removed one file from data/pre
.
When I run dvc repro
it will still process all the 999,999 files again, because it will remove the entire content of the ./data/post directory before running the process
stage. Is there any way to manage the pipeline so that process.py
will not process the same file twice?
The only way I could think of is creating process_file.py
which handles a single file, and executing 1 million commands like this:
dvc run -n process_1 -d process_file.py -d data/pre/1.txt -o data/post/1.txt python process_file 1.txt
dvc run -n process_2 -d process_file.py -d data/pre/2.txt -o data/post/2.txt python process_file 2.txt
…
dvc run -n process_1000000 -d process_file.py -d data/pre/1000000.txt -o data/post/1000000.txt python process_file 1000000.txt
Is there a more elegant way?