Need to build non-ML data pipeline, is DVC good fit?

Noted. Thanks.

The idea is that I would have two directories we can call IN/ and OUT/. The execution would proceed as

  1. inventory step to produce a list of files in an inventory file
  2. work step with inventory file as dependency. Read new files in IN/ and produce new files in OUT/
  3. next inventory step to produce a list of files in a second inventory file

As you can see, there would be no overlapping outputs.

For the purposes of inducing a topological ordering on process steps, this would not be a circular dependency. Such a state file is only viewed or updated by a single step and does not change the order that work should be done.

Moreover, since the state file is versioned, the work step is idempotent as well. Given dependencies and state file from a particular version, the processing step will produce identical next state and outputs.

Ah… that is good news. (I have a new reading assignment to learn how the run-cache is indexed).

Well, actually, I have done this exact strategy before and the results were really excellent because it gave me the file level incrementality that I needed most but it gave me the ability to reason about the state of all of my data. What I was missing back then was the version control and run-cache which are both primary features of DVC. I think that the combination of these ideas does not compromise the core ideas of DVC, but extends them to an important use case.

The reason that I think that this is an important use case for machine learning is that it is very common that I don’t have a static, academic sort of machine learning problem whereby I iterate on the learning parameters and features by monitoring the leaderboard. What I have always had in industrial settings was a continuously growing set of training data and unlabeled data. I needed to evaluate old models on newly labelled data and build new models that I would evaluate on new and old versions of the data. Scoring unlabeled data would drive new labeling efforts by stratified sampling.

As such, extracting and extending transaction histories from log files and scoring and training models from the new data were critical capabilities. There was no bright line between “data engineering” and “real data science”. There was also no clear line between data engineering for analytics versus data engineering for monitoring and BI.

So, from my own experience of a quarter century of building industrial machine learning systems, incremental processing falls squarely into what I would see as DVC’s mission. The folks building DVC might differ, but there isn’t much of a way for me to use the DVC otherwise since my entire (data) world is incremental.

1 Like