Hi all! I’m quite new to DVC and still in the process of fully understanding its capabilities and benefits. Similar to the discussion Need to build non-ML data pipeline, is DVC good fit?, I am still struggling with finding out whether DVC is a good fit for our data processing workflow or not.
Obviously, DVC suits perfectly for straightforward ML projects, where inputs are data and a set of hyper-parameters and the output is a trained model that can then be deployed somewhere. However, things seem to be a bit more complex for different types of use cases.
Use Case + Example Pipeline
Consider this example.
We’re building a data processing pipeline to analyze geo- and time-referenced images and also want to deploy it to a production server, where it then gets triggered on-demand or on a regular basis.
On a high level, input to that process are images plus some additional meta data. Output is the processing result, e.g. in form of a small snippet of JSON data.
The pipeline would approximately consist of these steps:
-
Download data: Given two parameters (
location
(an MGRS code) anddates
(as comma-separated list)), downloads all images for that location and day from a web service and saves them them to a network share in a file structure like this:-
image_data
-
4QFJ
(the location code)-
2022
-
02
-
16
image.jpg
-
...
(some auxiliary data, skipped for brevity)
-
-
-
-
-
-
Filter data: Given a parameter
filtering_method
, it filters such of these images, that satisfy certain criteria. The result could, for instance, be a JSON list of image paths. The method parameter would probably also go insideparams.yaml
, I presume. -
Process data: Takes the set of valid images from the previous step, loads them from the network drive (see first step), analyzes and processes them. Depending on the
filtering_method
this list is longer or shorter. In reality, this step is actually multiple separate steps, each of which has in- and outputs itself, but for simplicity that part is skipped. Output is a JSON file that is then saved to the network drive again, e.g. atresults/4QFJ_2022-02-16.json
.
Issues
As you can see, this is quite different from a classical ML preprocess - train - evaluate flow. Specifically, what makes me doubt if DVC is easily applicable here is:
- The pipeline depends on “runtime” parameters (
location
,date
) as input, that likely change for each run and are passed by the user (or an automated script, rather). How to model these? Updateparams.yaml
each time?- Additional question for general understanding: day
params.yaml
contained only one key, sayfoo
. If I run the pipeline withfoo: 4
, then withfoo: 5
and once again withfoo: 4
, would the last run be skipped, because it had earlier run with the same set of dependencies already?
- Additional question for general understanding: day
- The files to be used as dependencies are not a static set, but rather vary, depending on input parameters. It’s not always
data/train.csv
or so, but instead the set of dependencies (the filenames) to stage 2 vary with varying input parameters to step 1. How to help with this? Maybe using template variables indvc.yaml
? - Preferably, we’d also like to send a notification for each pipeline run by calling an online REST API. In other words, some of the pipeline steps have side effects, that can’t really be tracked with files. This is probably also a point where DVC falls short for us, because, if I ran the pipeline twice with same parameters and data, only one notification would be sent. However, this point is not too crucial.
What do you think, is DVC a good fit for us and if yes, how do I work around these “issues”?