The data versioning feature seems to couple with DVC pipeline, I already had my pipeline tool (e.g. Kedro / Airflow).
@nok, many of our users do use DVC just for data versioning. I am not that familiar with Kedro/Airflow, but you could either dvc add
in the scripts or through our Python API or via something integrated to those tools (BashOperator?).
Is there any use case where people using other DAGs library while using DVC?
DVC can be still used to get the appropriate data (using dvc get/checkout
on scripts or dvc.api.get
on python codebase) or save them. That is, use DVC only for versioning and use different pipeline tools, though, you have to put that in your scripts/pipeline-step, it’s not automatic/integrated for sure.
I am running more than 1 experiments, DVC doesn’t fit nicely with this situation
As you mentioned, experiments
is in a beta-state, though you could start using it today (and, would help us shape the feature with your feedback). I don’t have much to add here, as you seem to be aware of this, unless I got that question wrong?
I do like DVC handling the caching for me, the downside of having a separate folder for artifacts (my current approach) is duplicate files (if the data artifacts are the same, ideally it should just point to a reference instead of writing a new file)
Just to clarify, .dvc
files having entries for outputs where it’s been checked out?
We could do that, but, it might be too difficult to keep the whole repo in sync (the other entry might now point to a different version of the output later, which means entry has to be adjusted elsewhere).