DVC compatibility with parquet files (partitioned directory)

I want to read a parquet file that in turn contains more files of this style, according to the partitioning that parquet handles. In this case, how could I read them from the DVC API? Because generally, it gives me this error when dealing with a directory as services like Databricks store it: IsADirectoryError: ‘data/tests_dvc.parquet/’ is a directory, here is my code:

import dvc.api
import pandas as pd

with dvc.api.open(
    path='data/tests_dvc.parquet/',
    repo='link_to_my_repo',
    remote_config=remote_config,
    rev="my/dvc",
    mode='rb'
) as f:
    df = pd.read_parquet(f)
1 Like

Hi, dvc.api.open, similar to builtins.open can only open a file.

You can however use DVCFileSystem, which provides an fsspec compatible filesystem
interface that you can use with pd.read_parquet.

import dvc.api
import pandas as pd

filesystem = dvc.api.DVCFileSystem(
    'link_to_my_repo',
    remote_config=remote_config,
    rev="my/dvc",
)
df = pd.read_parquet(path='/data/tests_dvc.parquet/', filesystem=filesystem)
1 Like

I tried the alternative you mentioned, but it gives me an error: AttributeError: 'str' object has no attribute 'root_dir'

Can you please share the script that you tried, and the traceback?

Also, please make sure that you are using the latest dvc version (3.59.1 at the time of writing).

And please double-check that you are not passing repo= kwarg to DVCFileSystem, it should either be url= keyword argument or a first positional argument - unlike dvc.api.open.

Looks like I also made a mistake in the above code. pd.read_pandas takes a filesystem kwarg, not fs. I have corrected that above.

Thank you very much for the support in the error, the proportioned code has worked for me with the correction you mention.

1 Like