Non-overlapping outs paths

My working directory contains the following folders:

  • .dvc
  • .git
  • data
  • models
  • src

And files:

  • .gitignore
  • data.dvc

The data.dvc file currently points to all the contents of the data folder. The src folder contains a prepare_data.py file that takes in images from both the data/train and data/test folders as inputs and then outputs four files:

  • data/train/imgs_train.npy
  • data/train/train_labels.pkl
  • data/test/imgs_test.npy
  • data/test/test_labels.pkl

Now, I want to create a reproducible stage for src/prepare_data.py. To do this, I ran the following command:

dvc run -f prepare_data.dvc \
        -d src/prepare_data.py -d data/train -d data/test \
        -o data/train/imgs_train.npy -o data/train/train_labels.pkl \
        -o data/test/imgs_test.npy -o data/test/test_labels.pkl \
        python src/prepare_data.py

However, I received the following error message:

ERROR: failed to run command - Paths for outs:
'data'('data.dvc')
'data\train\imgs_train.npy'('prepare_data.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.

I can see that there is a problem with referencing data/train and data/test when the data folder has already been tracked using the data.dvc file. However, I still want to create the reproducible stage, so any suggestions?

For the record, this was answered on Discord: Discord

TLDR:

You’ll need to separate those somehow. Looks like you should not have done dvc add data , but probably something more granular.

1 Like