I have tried to list a whole directory as input or output to a stage but this doesn’t seem to work as I would hope. I would like, ideally, to have the directory contents monitored for changes when reproducing the pipeline as the files within could change and are too many to list individually as inputs or outputs.
Is this already possible (if so how?) or something that may be implemented in future? Or is there some other process to deal with this kind of situation? Am I trying to use the system in the wrong way?
Hi @tania !
Could you show examples of the commands that you were using?
The following is an example of what I did.
- create train_split.dvc with train_split directory as output
dvc run -f train_split.dvc -d src/config/train_split.json -d data/raw/train.csv.zip -o data/processed/train_split python src/data/train_split.py
- create build_features_target_enc with train_split directory as input
dvc run -f build_features_target_enc.dvc -d src/config/build_features_target_enc_smooth.json -d data/processed/train_split -d data/raw/test.csv.zip -o data/processed/target_enc_train_features.csv.gz -o data/processed/target_enc_test_features.csv.gz python src/features/build_features_target_enc_smooth.py
- After doing some other work the files inside the train_split directory had changed. I tried to rerun the pipeline expecting it to pick up on those changes and run itself from the build_features_target_enc step.
dvc repro build_features_target_enc.dvc
but it said that there were no changes.
So you’ve modified train_split by hand, right?
So dvc does track this dir as a whole. But the issue here is that you’ve run
dvc repro build_features_target_enc.dvc which went on reproducing(or trying to) every stage in the pipeline. It went to train_split.dvc and saw that train_split directory was changed, so it reproduced it from scratch, which produced original
train_split directory. Then it went on to
build_features_target_enc.dvc, and saw that
train_split dependency is the same as it was the last time
build_features_target_enc.dvc ran, so it decided that there is nothing to do, as dependencies have changed.
So what you could’ve done here is:
dvc repro build_features_target_enc.dvc --single-item
dvc repro build_features_target_enc.dvc --downstream
but this would be a workaround that would allow you to quickly experiment with this, but you need to be aware that you are breaking the reproducibility of your pipeline this way(by modifying intermediate results by hand), as
train_split.dvc is no longer the one that could produce your hand-modified
What you should really consider doing is, if possible, modifying your
train_split with a new dvc stage saving it as
train_split_new and then using it in the next pipeline stages. Or, if you absolutely need to modify it by hand, then copy
train_split_new, modify it, add it to dvc with
dvc add train_split_new and then base your next stages on it, so next time when you modify it once more, you could indeed do that in place and
dvc repro will simply pick up the changes. So the difference here is that
dvc add is considered as “given data” and
dvc run outputs are considered as “generated artifacts”, so the former ones are ok to hand-modify and
repro will take it as changes to the source data, but the latter is an intrudion into the intermediate pipeline artifacts and will be re-generated by running your dvcfile command again.
I didn’t modify it by hand but with another script, but I guess for dvc it is the same thing.
Thanks for the explanation, it makes sense. I guess the --single-item is exactly what I needed to do here. I will also take your suggestion and try to not break the reproducibility next time. Thank you!