Does DVC deal well with partitioned Parquet files in Hive format?

andrethrill · July 6, 2018, 10:21am

Parquet files in Hive format partitioned by col1 are, in fact, directories in the format:


/myfile.parquet
    /_common_metadata
    /_metadata
    /col1=val1
        /part0.parquet
        /part1.parquet
        /...
    /col1=val2
        /part10.parquet
        /part12.parquet
        /...
    /col1=...
        /part20.parquet
        /part21.parquet
        /...

The tree structure can go even deeper if the Parquet file in partitioned by several columns rather that just col1.

Will DVC deal well with this type of files, both when they are a dependency or an output?

Example, where myfile1.parquet and myfile2.parquet are Parquet files in Hive format partitioned by some col(s):

dvc add myfile1.parquet

# run script.py wich outputs myfile2.parquet
dvc run -d myfile1.parquet -d script.py -o myfile2.parquet python script.py

kupruser · July 6, 2018, 2:36pm

Hi @andrethrill !

Will DVC deal well with this type of files, both when they are a dependency or an output?

For dvc parquet in this format is just a directory and directories are indeed supported both as an output and as an input.

Thanks,
Ruslan

daniel.castillo99 · March 3, 2025, 3:31pm

I have the same issue, I want to read a parquet file that in turn contains more files of this style, according to the partitioning that parquet handles. In this case, how could I read them from the DVC API? Because generally, it gives me this error when dealing with a directory as services like Databricks store it: IsADirectoryError: ‘data/tests_dvc.parquet/’ is a directory, here is my code:

import dvc.api
import pandas as pd

with dvc.api.open(
    path='data/tests_dvc.parquet/',
    repo='link_to_my_repo',
    remote_config=remote_config,
    rev="my/dvc",
    mode='rb'
) as f:
    df = pd.read_parquet(f)

Topic		Replies	Views
DVC compatibility with parquet files (partitioned directory) Questions	5	83	March 20, 2025
How to set up DVC for hive dataset Questions	0	746	April 3, 2020
Using DuckDB to query DVC-versioned files directly in an object storage remote? Questions	1	36	March 14, 2025
Trace back files in non-S3 bucket in structured way Questions	0	15	July 23, 2024
Support for HDFS Feature Requests	3	1551	April 6, 2018

Does DVC deal well with partitioned Parquet files in Hive format?

Related topics