Processing data in place

nikste · April 23, 2020, 11:55am

Hi I’m new to dvc (nice project!)
We have a larger dataset and have several preprocessing scripts.
These scripts alter data in place.
It seems when I try to register it with dvc run it complains about cyclic dependencies (input is the same as output).
I would assume this is a very common use case.
What is the best practice here ?
Tried to google around but i did not see any solution to this (besides creating another folder for the output).

Paffciu · April 23, 2020, 12:52pm

Hi @nikste!
The usual workflow that our users take, is to produce the output folder and write processed data there. The reason for that is replacing existing data might result in preventing the user from reproducing the results. And DVC was created in order to make sure that users will be able to reproduce their results.
Why would you like to overwrite your data?

nikste · April 23, 2020, 1:23pm

HI @Paffciu, thanks for your quick answer!
you would delete the input folder again, following the input output folder approach ?

Well, thinking about Git, i would not store a different file for an input and output source file either if the code changes. There is only one file to be used and git keeps track of changes.
I would expect a dataset to be the same and dvc providing me the capability of rolling back to past states without needing to change any folder names.

Consider the case i have multiple downstream algorithms that want to work with the dataset.
They expect a certain structure for the data.
I have some other process that continuously improves the data (different scripts that improves annotations in some automatic way / maybe some humans labelling pixel wise segmentations).
Using a new folder after each change seems a bit counter intuitive to me. It seems I’d need to manually tell the downstream algorithms that the folder structure changed.

Maybe I’m missing something here, or misusing dvc here?

nikste · April 23, 2020, 1:40pm

If i would like to reproduce previous results, I’d just rollback the state of the dataset to a certain hash, much like in git. Then rerun any algorithm with the resulting data. If that makes sense

dmitry · April 23, 2020, 2:24pm

@nikste good question. Could you please explain the motivation of removing old files?

Is it something that you do only once in your project or a repeating action?
Why don’t you just keep an old version in a separate dir?
I understand that you need to version data - the old version and the new one. How about pipeline? This is not something you will be repeating - I guess. Do you need to track this single operation in a pipeline (it might be done outside of the p)?

nikste · April 23, 2020, 8:52pm

So the use case we currently have is that we want to create a new dataset from sort of scratch.
We have some class agnostic bounding boxes available.
These are converted from a different format (using a script).
We then have some classifier that can give use some automatic classification for each bounding box (using a script).
We then validate some data with humans and change some labels (these are copied again using a script)
In the future there might be additional automatically extracted labels / extra data we want to add to this dataset.

So for me conceptually each of these steps would operate on the same data and replace (improve) it.
Running a classifier, loading input annotations, adding some information and saving them in place,
using dvc run will result in cyclic dependencies if run on the same folder.
Keeping different versions in different folders is certainly possible, but reminds me a lot of my version control system during early student times (final_report_copy_copy.doc ).
I figure dvc run is more intended to keep track of something like converting original data to tensorflow records, where a separate folder makes a lot of sense.
Maybe there is a good way and I did not see it yet though!

shcheklein · April 24, 2020, 12:25am

Hi @nikste!

It sounds like active learning, right? You are applying previously trained models to the dataset to improve labels and train a new model … and so on?

I got your point about data versioning. The simplest solution in this case would be just use dvc add data. Where data contains all - model + dataset + labels.

In this case you just run you script on your own which alters the data as you want and you just do again:

$ dvc add data
$ git commit data.dvc -m "my new iteration"

The question is - how much of a reproducibility do you expect from this project?

To completely reproduce active learning you would need to go from the very first iteration and apply one step at a time - like running a cycle 1000 times. Do you need that? Or would it be enough for you to just capture versions and being able somewhat manually reproduce it (just checkout the previous commit and the script again?)

nikste · April 28, 2020, 8:29pm

Thanks @shcheklein,
proposed method seems to be a bit nicer
Indeed active learning would be another use case where cyclical dependencies would be cool!
In our case its really just a script that adds some information to xml files (the classes of bounding boxes).

I see that dvc run specifies inputs and output md5 hashes and those hashes are used to traverse from the end to the start to see when something has changed and rerun from there.
With a cyclic dependency it could still check if the folder contents has changed.
We would indeed not be able to know if the “recursive” script has been run, but in this case we could assume that it needs to be run again by default (of course the user needs to be made aware that it will be run by default).
It could be an additional parameter for dvc run.
Does that make sense?

shcheklein · April 29, 2020, 12:26am

@nikste

good question!

If we talk about cycles in general, in the workflow graph itself, when you have multiple stages, I’m not sure it’s possible to come up with a good semantics, order of execution, etc … it’s painful even to think about this
If we take only these loop-self-updating-stages. I think we can do this already with DVC, and they can be part of the bigger “DAG”:

Let’s take a script that appends a new item every time you run it:

$ cat test.sh
#!/bin/bash

echo "new entry" >> data.lst

Run commands:

$ chmod +x test.sh
$ dvc run --outs-persist data.lst ./test.sh

It will create an “always changed” stage that also persists its output between runs:

try to run dvc repro and you will see new entries in the file.

The only caveat here is that we don’t have the information (hash) of the previous version of the dataset we had. It can be potentially a feature request for DVC, but there are easy workarounds for that (just part of the script can be saving the stage file as part of the output, or something like that).

The biggest question would be - if we store the previous hash in the stage file, how that should affect/be supported in other DVC commands? E.g. dvc repro - clearly it runs the stage always on the latest data, it doesn’t care about the previous versions of it. Or should there be some special “dvc advance” command for such stages?

Topic		Replies	Views
Need to build non-ML data pipeline, is DVC good fit? Questions	7	1183	August 24, 2021
Creating an aggregate .dvc file Questions	11	3350	October 17, 2018
Whole directory as input or output Questions	4	2128	September 27, 2019
Using DVC for end-to-end pipeline Questions	6	1616	January 5, 2019
DVC for analytics pipeline with runtime parameters and variable dependencies Questions	3	1008	February 18, 2022

Processing data in place

Related topics