I have tried to list a whole directory as input or output to a stage but this doesn’t seem to work as I would hope. I would like, ideally, to have the directory contents monitored for changes when reproducing the pipeline as the files within could change and are too many to list individually as inputs or outputs.
Is this already possible (if so how?) or something that may be implemented in future? Or is there some other process to deal with this kind of situation? Am I trying to use the system in the wrong way?
After doing some other work the files inside the train_split directory had changed. I tried to rerun the pipeline expecting it to pick up on those changes and run itself from the build_features_target_enc step.
So dvc does track this dir as a whole. But the issue here is that you’ve run dvc repro build_features_target_enc.dvc which went on reproducing(or trying to) every stage in the pipeline. It went to train_split.dvc and saw that train_split directory was changed, so it reproduced it from scratch, which produced original train_split directory. Then it went on to build_features_target_enc.dvc, and saw that train_split dependency is the same as it was the last time build_features_target_enc.dvc ran, so it decided that there is nothing to do, as dependencies have changed.
but this would be a workaround that would allow you to quickly experiment with this, but you need to be aware that you are breaking the reproducibility of your pipeline this way(by modifying intermediate results by hand), as train_split.dvc is no longer the one that could produce your hand-modified train_split.
What you should really consider doing is, if possible, modifying your train_split with a new dvc stage saving it as train_split_new and then using it in the next pipeline stages. Or, if you absolutely need to modify it by hand, then copy train_split into train_split_new, modify it, add it to dvc with dvc add train_split_new and then base your next stages on it, so next time when you modify it once more, you could indeed do that in place and dvc repro will simply pick up the changes. So the difference here is that dvc add is considered as “given data” and dvc run outputs are considered as “generated artifacts”, so the former ones are ok to hand-modify and repro will take it as changes to the source data, but the latter is an intrudion into the intermediate pipeline artifacts and will be re-generated by running your dvcfile command again.
I didn’t modify it by hand but with another script, but I guess for dvc it is the same thing.
Thanks for the explanation, it makes sense. I guess the --single-item is exactly what I needed to do here. I will also take your suggestion and try to not break the reproducibility next time. Thank you!