My working directory contains the following folders:
- .dvc
- .git
- data
- models
- src
And files:
- .gitignore
- data.dvc
The data.dvc
file currently points to all the contents of the data folder. The src folder contains a prepare_data.py
file that takes in images from both the data/train and data/test folders as inputs and then outputs four files:
- data/train/imgs_train.npy
- data/train/train_labels.pkl
- data/test/imgs_test.npy
- data/test/test_labels.pkl
Now, I want to create a reproducible stage for src/prepare_data.py
. To do this, I ran the following command:
dvc run -f prepare_data.dvc \
-d src/prepare_data.py -d data/train -d data/test \
-o data/train/imgs_train.npy -o data/train/train_labels.pkl \
-o data/test/imgs_test.npy -o data/test/test_labels.pkl \
python src/prepare_data.py
However, I received the following error message:
ERROR: failed to run command - Paths for outs:
'data'('data.dvc')
'data\train\imgs_train.npy'('prepare_data.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
I can see that there is a problem with referencing data/train and data/test when the data folder has already been tracked using the data.dvc
file. However, I still want to create the reproducible stage, so any suggestions?