Does DVC deal well with partitioned Parquet files in Hive format?

Parquet files in Hive format partitioned by col1 are, in fact, directories in the format:


/myfile.parquet
    /_common_metadata
    /_metadata
    /col1=val1
        /part0.parquet
        /part1.parquet
        /...
    /col1=val2
        /part10.parquet
        /part12.parquet
        /...
    /col1=...
        /part20.parquet
        /part21.parquet
        /...


The tree structure can go even deeper if the Parquet file in partitioned by several columns rather that just col1.

Will DVC deal well with this type of files, both when they are a dependency or an output?

Example, where myfile1.parquet and myfile2.parquet are Parquet files in Hive format partitioned by some col(s):

dvc add myfile1.parquet

# run script.py wich outputs myfile2.parquet
dvc run -d myfile1.parquet -d script.py -o myfile2.parquet python script.py

Hi @andrethrill !

Will DVC deal well with this type of files, both when they are a dependency or an output?

For dvc parquet in this format is just a directory and directories are indeed supported both as an output and as an input.

Thanks,
Ruslan

1 Like