So you’ve modified train_split by hand, right?
So dvc does track this dir as a whole. But the issue here is that you’ve run
dvc repro build_features_target_enc.dvc which went on reproducing(or trying to) every stage in the pipeline. It went to train_split.dvc and saw that train_split directory was changed, so it reproduced it from scratch, which produced original
train_split directory. Then it went on to
build_features_target_enc.dvc, and saw that
train_split dependency is the same as it was the last time
build_features_target_enc.dvc ran, so it decided that there is nothing to do, as dependencies have changed.
So what you could’ve done here is:
dvc repro build_features_target_enc.dvc --single-item
dvc repro build_features_target_enc.dvc --downstream
but this would be a workaround that would allow you to quickly experiment with this, but you need to be aware that you are breaking the reproducibility of your pipeline this way(by modifying intermediate results by hand), as
train_split.dvc is no longer the one that could produce your hand-modified
What you should really consider doing is, if possible, modifying your
train_split with a new dvc stage saving it as
train_split_new and then using it in the next pipeline stages. Or, if you absolutely need to modify it by hand, then copy
train_split_new, modify it, add it to dvc with
dvc add train_split_new and then base your next stages on it, so next time when you modify it once more, you could indeed do that in place and
dvc repro will simply pick up the changes. So the difference here is that
dvc add is considered as “given data” and
dvc run outputs are considered as “generated artifacts”, so the former ones are ok to hand-modify and
repro will take it as changes to the source data, but the latter is an intrudion into the intermediate pipeline artifacts and will be re-generated by running your dvcfile command again.