Working on remote data

I want to use DVC to version a large dataset (does not fit in local storage) folder stored in a S3-like database.
In my git repo, I used the following command to track my remote dataset:

dvc import-url --no-download remote://minio/dataset_v1.0 data

So, I have a “dataset_v1.0.dvc” in my local “data” folder.

Now, I want to write a python script that load and transform this dataset (several subdirs containing files).

I am a bit confused about the path I should use in my Python code to open such files.

e.g. Using “open(“dataset/dataset_v1.0.dvc/some_subdir/some_file.csv”, “r”)” does not seam to be a good idea.

What should I do ?

Have you looked at DVCFileSystem? That API should give you enough flexibility to work with the files however you need.