I need to build a data pipeline that basically goes like this:
- download new data hourly occasionally creating a new file in a directory
- extract data into usable form, create new file in output directory
- munch on all data for any day that any changed hourly data
- position changed output files into directory for on-line query service
As I see it, having an entire directory as an output for step 1, 2 and 3 and entire directories as dependencies for steps 2, 3, and 4 could work, but processing steps would need to examine input directories to see what has changed since they last ran. I would prefer to push that logic into the workflow manager.
An alternative would be to add dependencies or outputs each time a new file appears but that seems like just as much work as managing the workflow manually and leads to a kind of ugly DAG. At least I could intelligently version workflow steps this way, unlike option 3.
A third option would be be to create an entire dependency chain for each new file downloaded by step (1). This leads to a really furry processing DAG that will be hard to understand and nearly impossible to consistently change processing steps.
Alternative workflow engines like airflow and luigi and pachyderm might be alternatives if I am trying to force-fit DVC into a round hole, but I really like DVC’s general philosophy.
What is the best way to handle this kind of problem? Is DVC appropriate?