Accessing experiments in any arbitrary branch / commit

signorec · February 29, 2024, 12:06pm

Hi there. Like many I’m still on the learning curve with DVC experiments, and getting my head around what persists / what doesn’t with experiments before and after git committing the current workspace. And also what remains to be accessible at different points in the process.

What I would like to know how to do, if it’s possible, is how to use an experiment name to dig into the DVC and / or Git space to access all input data, output data and artefacts associated with a particular experiment run. This could be at an arbitrary point in time when several Git commits and experiments have been done since the exp in question.

Note that to run experiments I’m using the command dvc exp run -S <param-1> -S <param-2> ...

Thanks.

dberenbaum · February 29, 2024, 2:14pm

You can use dvc exp apply or dvc exp branch to recreate the entire state of any experiment. If you want to retrieve specific artifacts without changing your entire workspace, you can use dvc get or the Python API to and pass the SHA associated with the experiment.

signorec · February 29, 2024, 5:16pm

Thank you. I tried both dvc get and the Python API. Both approaches get the requested file if I don’t specify the experiment SHA (654c41d) / name (tangy-inks), but fail when I do.

Below are some commands and responses when using dvc get:

# These 2 commands get the file and store it in the --out location.
dvc get data/preprocessed encoded_pat_data.parquet --out encoded_pat_data_tangy_inks.parquet
dvc get . data/preprocessed/encoded_pat_data.parquet --out encoded_pat_data_tangy_inks.parquet

# The following 2 commands give: ERROR: unexpected error - [Errno 2] No such file or directory: 'encoded_pat_data.parquet': ('encoded_pat_data.parquet',) 
dvc get data/preprocessed encoded_pat_data.parquet --rev tangy-inks --out encoded_pat_data_tangy_inks.parquet
dvc get data/preprocessed encoded_pat_data.parquet --rev 654c41d --out encoded_pat_data_tangy_inks.parquet

When using the API here’s what happens:

Code:

import dvc.api
import pandas as pd

REV = 'tangy-inks' 
# REV = '654c41d' # 'tangy-inks' 

with dvc.api.open(
    path="data/preprocessed/encoded_pat_data.parquet",
    mode="rb",
    # rev=REV
) as f:
    print(pd.read_parquet(f))

Output when commenting out the rev=REV line: Success: shows the expected DataFrame.

Error when uncommenting it:
PathMissingError: The path 'data/preprocessed/encoded_pat_data.parquet' does not exist in the target repository 'data/preprocessed/encoded_pat_data.parquet' neither as a DVC output nor as a Git-tracked file.

dberenbaum · February 29, 2024, 6:13pm

It looks like there is a typo in you dvc get command. It should look like:

dvc get . data/preprocessed/encoded_pat_data.parquet --rev tangy-inks --out encoded_pat_data_tangy_inks.parquet

However, if dvc.api.open is giving a similar error, it may mean that this path was not track as a DVC output in that experiment as the message suggests. You could try dvc exp apply to retrieve the state of that experiment and see if data/preprocessed/encoded_pat_data.parquet is tracked as a DVC output for that experiment.

daniel.castillo99 · March 3, 2025, 3:28pm

I have the same issue, I want to read a parquet file that in turn contains more files of this style, according to the partitioning that parquet handles. In this case, how could I read them from the DVC API? Because generally, it gives me this error when dealing with a directory as services like Databricks store it: IsADirectoryError: ‘data/tests_dvc.parquet/’ is a directory, here is my code:

import dvc.api
import pandas as pd

with dvc.api.open(
    path='data/tests_dvc.parquet/',
    repo='link_to_my_repo',
    remote_config=remote_config,
    rev="my/dvc",
    mode='rb'
) as f:
    df = pd.read_parquet(f)

Topic		Replies	Views
Programatic access to experiments Questions	2	379	February 1, 2023
Looking for Workflow Suggestion Questions	2	182	December 21, 2023
Dvc exp show: experiment not showing / wrong position Questions	14	1780	August 17, 2022
Best practice for queuing experiments on code changes Questions	2	886	April 6, 2021
Experiments go missing when commits are amended Questions	2	304	January 2, 2024

Accessing experiments in any arbitrary branch / commit

Related topics