I setup some dvc data pipeline and use dvc repro command generated some .pkl files. I later on decided to regenerate the files (due to code change, where the dependency wasn’t captured in the dvc.yaml file), so I deleted the file manually in the directory. This is probably the first mistake I made.
So now when I run dvc repro, dvc refuse to re-run and report nothing has been change.
I thought dvc repro --force-downstream [target] should work, but it’s still reporting “everything is up to date” and not reproducing the stage.
How do I force dvc to re-run the stage to regenerate the .pkl file?
Short answer is yes. Please see the train and test stages from my dvc.yaml below.
training:
cmd: python3 code/train.py
params:
- train
deps:
- ./data/SWT_transform/Hisar_test_data_firstOrder.npy
- ./data/SWT_transform/Hisar_test_data_secondOrder.npy
- ./data/SWT_transform/Hisar_test_data_total.npy
- ./data/SWT_transform/Hisar_train_data_firstOrder.npy
- ./data/SWT_transform/Hisar_train_data_secondOrder.npy
- ./data/SWT_transform/Hisar_train_data_total.npy
outs:
- ./model/rbf_svm_3_.pkl #The model ID needs to be updated for each new model generated
- ./model/scaler3.bin
Thanks! The pipeline itself looks fine, but you might want to add code/train.py to dependencies. That way, when you change train.py and run dvc repro, the training stage will be re-run since one of its dependencies changed.
To force reproducing without adding train.py you could use dvc repro -f