Using DuckDB to query DVC-versioned files directly in an object storage remote?

petebachant · March 14, 2025, 2:31pm

With DuckDB you can easily query an S3 bucket based on a glob pattern, taking advantage of hive-partitioning to avoid downloading all files and columns, e.g.,

select some_col from 'mydata/*.parquet' where some_other_col = 5

Since DVC stores files by MD5 hash in object storage, glob patterns that would work locally don’t work in S3.

Has anyone dealt with a similar use case, where you don’t want to dvc pull all files to run the query locally? The only solution I can think of at the moment is to duplicate the working directory structure at a given commit somewhere else in object storage and query that.

petebachant · March 14, 2025, 2:36pm

I just realized DuckDB is compatible with fsspec filesystems, and DVC provides one. Would it be possible to adapt the example in the docs to use a DVC filesystem?

Topic		Replies	Views
Trace back files in non-S3 bucket in structured way Questions	0	15	July 23, 2024
Access to files uploaded with dvc without dvc Questions	1	393	June 11, 2022
Can my workspace be an s3 location? Questions	2	486	March 2, 2022
Access remote data instead of downloading it Questions	8	669	March 3, 2023
Get files directly from remote Questions	1	51	January 6, 2025

Using DuckDB to query DVC-versioned files directly in an object storage remote?

Related topics