DVC for analytics pipeline with runtime parameters and variable dependencies

Hi all! I’m quite new to DVC and still in the process of fully understanding its capabilities and benefits. Similar to the discussion Need to build non-ML data pipeline, is DVC good fit?, I am still struggling with finding out whether DVC is a good fit for our data processing workflow or not.

Obviously, DVC suits perfectly for straightforward ML projects, where inputs are data and a set of hyper-parameters and the output is a trained model that can then be deployed somewhere. However, things seem to be a bit more complex for different types of use cases.

Use Case + Example Pipeline

Consider this example.

We’re building a data processing pipeline to analyze geo- and time-referenced images and also want to deploy it to a production server, where it then gets triggered on-demand or on a regular basis.

On a high level, input to that process are images plus some additional meta data. Output is the processing result, e.g. in form of a small snippet of JSON data.

The pipeline would approximately consist of these steps:

  1. Download data: Given two parameters (location (an MGRS code) and dates (as comma-separated list)), downloads all images for that location and day from a web service and saves them them to a network share in a file structure like this:
    • image_data
      • 4QFJ (the location code)
        • 2022
          • 02
            • 16
              • image.jpg
              • ... (some auxiliary data, skipped for brevity)
  2. Filter data: Given a parameter filtering_method, it filters such of these images, that satisfy certain criteria. The result could, for instance, be a JSON list of image paths. The method parameter would probably also go inside params.yaml, I presume.
  3. Process data: Takes the set of valid images from the previous step, loads them from the network drive (see first step), analyzes and processes them. Depending on the filtering_method this list is longer or shorter. In reality, this step is actually multiple separate steps, each of which has in- and outputs itself, but for simplicity that part is skipped. Output is a JSON file that is then saved to the network drive again, e.g. at results/4QFJ_2022-02-16.json.

Issues

As you can see, this is quite different from a classical ML preprocess - train - evaluate flow. Specifically, what makes me doubt if DVC is easily applicable here is:

  1. The pipeline depends on “runtime” parameters (location, date) as input, that likely change for each run and are passed by the user (or an automated script, rather). How to model these? Update params.yaml each time?
    • Additional question for general understanding: day params.yaml contained only one key, say foo. If I run the pipeline with foo: 4, then with foo: 5 and once again with foo: 4, would the last run be skipped, because it had earlier run with the same set of dependencies already?
  2. The files to be used as dependencies are not a static set, but rather vary, depending on input parameters. It’s not always data/train.csv or so, but instead the set of dependencies (the filenames) to stage 2 vary with varying input parameters to step 1. How to help with this? Maybe using template variables in dvc.yaml?
  3. Preferably, we’d also like to send a notification for each pipeline run by calling an online REST API. In other words, some of the pipeline steps have side effects, that can’t really be tracked with files. This is probably also a point where DVC falls short for us, because, if I ran the pipeline twice with same parameters and data, only one notification would be sent. However, this point is not too crucial.

What do you think, is DVC a good fit for us and if yes, how do I work around these “issues”?

1 Like

The pipeline depends on “runtime” parameters ( location , date ) as input, that likely change for each run and are passed by the user (or an automated script, rather). How to model these? Update params.yaml each time?

Updating params.yaml would be the way to do this in DVC pipelines. If you are already generating the location and date via an automated script, what about just having the script write the appropriate values into params.yaml? So the automated script would be separate from your pipeline, and it would potentially do update params.yaml and then trigger dvc repro?

  • Additional question for general understanding: day params.yaml contained only one key, say foo . If I run the pipeline with foo: 4 , then with foo: 5 and once again with foo: 4 , would the last run be skipped, because it had earlier run with the same set of dependencies already?

Yes, in this case DVC would checkout the outputs generated by the previous foo: 4 run instead of re-executing the stage. If your stage is non-deterministic (and you expect different outputs each time even if the parameter value is unchanged) you can force DVC to always execute the stage by setting the always_changed flag.

The files to be used as dependencies are not a static set , but rather vary, depending on input parameters. It’s not always data/train.csv or so, but instead the set of dependencies (the filenames) to stage 2 vary with varying input parameters to step 1. How to help with this? Maybe using template variables in dvc.yaml ?

Are the files consistent based on the input parameters, meaning for a given location, date combination, the set of files will always be the same (but the files/filenames will be different for other location, date combinations)?

In this case, you can just tell DVC to track a single output directory path, and have your “download data” stage download all of the files into that directory. You don’t need to tell DVC each individual path to track. You would then use that directory path as the dependency for the next (filtering) stage in your pipeline.

Preferably, we’d also like to send a notification for each pipeline run by calling an online REST API. In other words, some of the pipeline steps have side effects , that can’t really be tracked with files. This is probably also a point where DVC falls short for us, because, if I ran the pipeline twice with same parameters and data, only one notification would be sent. However, this point is not too crucial.

This could also potentially be accomplished using the always_changed flag to force DVC to run stages with these types of side effects even when dependencies haven’t changed

Thanks a lot for your comprehensive answer and for helping me understand DVC better!

Are the files consistent based on the input parameters, meaning for a given location , date combination, the set of files will always be the same

Yes. Their paths will include both parameters, as well as another component that is determined by the script at runtime. Although it’s not known upfront, it’s always the same for the same input parameters.

You would then use that directory path as the dependency for the next (filtering) stage in your pipeline.

But then I couldn’t really benefit from the dependency tracking / hashing feature in its entirety anymore, could I?

Let’s say I specify ~/data as a dependency. Let the pipeline then run for some date d1, resulting, e.g. in ~/data/d1/* being created. Let it then run for d2, resulting in ~/data/d2/* and once again for d1. Since contents of ~/data have changed, the third run would still be performed, although the data relevant for d1 hasn’t changed, right?

In addition, I have one more question. I’d like to know whether it’s a good practice to use DVC with a shared file system (e.g. NFS / CIFS mount, WebDAV, …)? Can I run my DVC pipeline on two different hosts, that both access the same file system, and benefit from DVC’s benefits? Should dvc.lock also reside on the shared drive then?

But then I couldn’t really benefit from the dependency tracking / hashing feature in its entirety anymore, could I?

Let’s say I specify ~/data as a dependency. Let the pipeline then run for some date d1, resulting, e.g. in ~/data/d1/* being created. Let it then run for d2, resulting in ~/data/d2/* and once again for d1. Since contents of ~/data have changed, the third run would still be performed, although the data relevant for d1 hasn’t changed, right?

My suggestion was to just use data/ to store a single dependency’s dataset. So rather than have data/d1/*, data/d2/*, data/d3/* your fetch stage would just clear the entire contents of data/ and then download the files directly into data/* (rather than appending each fetch dataset on top of the previous datasets). After the first run, data/ would only contain the d1 files, after the second run data/ would only contain the d2 files, and so on.

Doing it this way means that the dependency tracking/hashing in DVC will still work, since DVC will be aware of the multiple different possible “dep states” that it has seen for data/ (since data will have a different possible hash for each of d1, d2, d3, …). So the next time you fetch d1, DVC would see that it has executed a run for this data/ hash (via the DVC run-cache) and then the subsequent (non-fetch) stages in your pipeline would not be re-run.

I’d like to know whether it’s a good practice to use DVC with a shared file system (e.g. NFS / CIFS mount, WebDAV, …)?

The recommended way of doing this would be to keep your DVC cache directory on the network mount, and configure DVC to use symlinks to that cache directory (see https://dvc.org/doc/user-guide/how-to/share-a-dvc-cache#how-to-share-a-dvc-cache). This way, all of your DVC-tracked data would remain on the network mount, but your main repo workspace would still be on a local drive. This also allows you to re-use one shared cache directory across multiple DVC repos if that is something you are looking for. (Note that the shared cache scenario works for shared local directories as well, using network storage isn’t required)

It is also possible to use the network mount as your main repo workspace directory, but this comes with some caveats. In general, DVC is not optimized for use on network filesystems, and will likely come with a significant performance hit (compared to keeping your repo workspace on local storage).

There are also a few DVC config options which may need to be set depending on your use case and specific network filesystem in this case:

  • core.hardlink_lock - specifically for NFS
  • state.dir, index.dir - DVC uses sqlite for caching certain internal lookups, but sqlite does not work properly on certain network filesystems, so you will likely need to set these options to point to a local filesystem directory (like /tmp)