Does DVC deal well with partitioned Parquet files in Hive format?

Parquet files in Hive format partitioned by col1 are, in fact, directories in the format:


/myfile.parquet
    /_common_metadata
    /_metadata
    /col1=val1
        /part0.parquet
        /part1.parquet
        /...
    /col1=val2
        /part10.parquet
        /part12.parquet
        /...
    /col1=...
        /part20.parquet
        /part21.parquet
        /...


The tree structure can go even deeper if the Parquet file in partitioned by several columns rather that just col1.

Will DVC deal well with this type of files, both when they are a dependency or an output?

Example, where myfile1.parquet and myfile2.parquet are Parquet files in Hive format partitioned by some col(s):

dvc add myfile1.parquet

# run script.py wich outputs myfile2.parquet
dvc run -d myfile1.parquet -d script.py -o myfile2.parquet python script.py

Hi @andrethrill !

Will DVC deal well with this type of files, both when they are a dependency or an output?

For dvc parquet in this format is just a directory and directories are indeed supported both as an output and as an input.

Thanks,
Ruslan

1 Like

I have the same issue, I want to read a parquet file that in turn contains more files of this style, according to the partitioning that parquet handles. In this case, how could I read them from the DVC API? Because generally, it gives me this error when dealing with a directory as services like Databricks store it: IsADirectoryError: ‘data/tests_dvc.parquet/’ is a directory, here is my code:

import dvc.api
import pandas as pd

with dvc.api.open(
    path='data/tests_dvc.parquet/',
    repo='link_to_my_repo',
    remote_config=remote_config,
    rev="my/dvc",
    mode='rb'
) as f:
    df = pd.read_parquet(f)