Does DVC deal well with partitioned Parquet files in Hive format?

andrethrill · July 6, 2018, 10:21am

Parquet files in Hive format partitioned by col1 are, in fact, directories in the format:


/myfile.parquet
    /_common_metadata
    /_metadata
    /col1=val1
        /part0.parquet
        /part1.parquet
        /...
    /col1=val2
        /part10.parquet
        /part12.parquet
        /...
    /col1=...
        /part20.parquet
        /part21.parquet
        /...

The tree structure can go even deeper if the Parquet file in partitioned by several columns rather that just col1.

Will DVC deal well with this type of files, both when they are a dependency or an output?

Example, where myfile1.parquet and myfile2.parquet are Parquet files in Hive format partitioned by some col(s):

dvc add myfile1.parquet

# run script.py wich outputs myfile2.parquet
dvc run -d myfile1.parquet -d script.py -o myfile2.parquet python script.py

kupruser · July 6, 2018, 2:36pm

Hi @andrethrill !

Will DVC deal well with this type of files, both when they are a dependency or an output?

For dvc parquet in this format is just a directory and directories are indeed supported both as an output and as an input.

Thanks,
Ruslan

Topic		Replies	Views
Support for HDFS Feature Requests	3	1487	April 6, 2018
Using DVC with/in subtrees Questions	0	108	December 8, 2023
DVC integration with AZURE ML Pipeline and versioning IOT data Questions	12	3158	May 23, 2020
Create a stage where the output is a directory Questions	1	300	April 29, 2022
Best practices for specifying dependencies Questions	2	944	September 12, 2018

Does DVC deal well with partitioned Parquet files in Hive format?

Related Topics