Need to build non-ML data pipeline, is DVC good fit?

tdunning · August 24, 2021, 11:51pm

Noted. Thanks.

The idea is that I would have two directories we can call IN/ and OUT/. The execution would proceed as

inventory step to produce a list of files in an inventory file
work step with inventory file as dependency. Read new files in IN/ and produce new files in OUT/
next inventory step to produce a list of files in a second inventory file

As you can see, there would be no overlapping outputs.

For the purposes of inducing a topological ordering on process steps, this would not be a circular dependency. Such a state file is only viewed or updated by a single step and does not change the order that work should be done.

Moreover, since the state file is versioned, the work step is idempotent as well. Given dependencies and state file from a particular version, the processing step will produce identical next state and outputs.

Ah… that is good news. (I have a new reading assignment to learn how the run-cache is indexed).

Well, actually, I have done this exact strategy before and the results were really excellent because it gave me the file level incrementality that I needed most but it gave me the ability to reason about the state of all of my data. What I was missing back then was the version control and run-cache which are both primary features of DVC. I think that the combination of these ideas does not compromise the core ideas of DVC, but extends them to an important use case.

The reason that I think that this is an important use case for machine learning is that it is very common that I don’t have a static, academic sort of machine learning problem whereby I iterate on the learning parameters and features by monitoring the leaderboard. What I have always had in industrial settings was a continuously growing set of training data and unlabeled data. I needed to evaluate old models on newly labelled data and build new models that I would evaluate on new and old versions of the data. Scoring unlabeled data would drive new labeling efforts by stratified sampling.

As such, extracting and extending transaction histories from log files and scoring and training models from the new data were critical capabilities. There was no bright line between “data engineering” and “real data science”. There was also no clear line between data engineering for analytics versus data engineering for monitoring and BI.

So, from my own experience of a quarter century of building industrial machine learning systems, incremental processing falls squarely into what I would see as DVC’s mission. The folks building DVC might differ, but there isn’t much of a way for me to use the DVC otherwise since my entire (data) world is incremental.

Topic		Replies	Views
DVC for analytics pipeline with runtime parameters and variable dependencies Questions	3	996	February 18, 2022
DVC Heartbeat - Discord gems Announcements	3	4164	June 27, 2019
Whole directory as input or output Questions	4	2120	September 27, 2019
Versioning predictions Questions	7	954	February 10, 2021
Managing pipelines operating per dataset element Questions	6	1649	January 13, 2021

Need to build non-ML data pipeline, is DVC good fit?

Related topics