Parquet files in Hive format partitioned by col1
are, in fact, directories in the format:
/myfile.parquet
/_common_metadata
/_metadata
/col1=val1
/part0.parquet
/part1.parquet
/...
/col1=val2
/part10.parquet
/part12.parquet
/...
/col1=...
/part20.parquet
/part21.parquet
/...
The tree structure can go even deeper if the Parquet file in partitioned by several columns rather that just col1
.
Will DVC deal well with this type of files, both when they are a dependency or an output?
Example, where myfile1.parquet
and myfile2.parquet
are Parquet files in Hive format partitioned by some col(s):
dvc add myfile1.parquet
# run script.py wich outputs myfile2.parquet
dvc run -d myfile1.parquet -d script.py -o myfile2.parquet python script.py