DVC with video data

Is there any example of using DVC with video data?

Is DVC capable of managing this kind of large data in a reasonable time?

I would very much see some kind of example.

@Aliaga we had users / customers with 70TB DVC cache (videos + images). It’s possible to setup in general, though I would not necessarily recommend using DVC directly to manage large amounts of unstructured data. We have a new version of DVC - DVCx released soon. If you are interested - let’s jump on a call to discuss what can be done with DVC, discuss DVCx and how to manage large amounts of data. Let me (I’m CTO and co-founder) if you have time this or next week and if it’s fine for me to reach out you directly.

Going forward, is DVC not meant to be used for large datasets and that’s what DVCx will address?

@gregstarr DVC will operate in the same way (with all pros and cons) as it does now + we’ll keep doing optimizations, etc. There are ways to make it efficient with large datasets (e.g. attached shared cache + symlinks - is a very powerful scenario). DVCx is operating with different assumptions (e.g. it doesn’t move data) and a way to slice and dice using metadata. I would be happy to catch up and show / discuss this.

Ok, interesting. Will DVCx maintain the same level of traceability and reproducibility of DVC, e.g. by hashing data products and source code? Is DVCx meant to be a replacement for DVC or is it a different tool?

@gregstarr yes, it has traceability and reproducibility and immutability built-in. Not necessarily by calculating hashes in this case. (it would be better to show and discuss it - since it operates in a different mode).

It’s not a replacement - it’s rather an upstream tool that helps ML engineers to curate, iterate, enrich, version, etc their unstructured data. DVC is a downstream tool in this case - it takes a DVCx dataset, runs the pipeline, collect metrics, etc, etc.

let me know, happy to do a session and show how it works.

I am interested but I’m not sure it will be worth your time to do a full session. Do you have any slides or documents I could look at?

Hey sure, I’ve sent an email, @gregstarr .