Hi there. Like many I’m still on the learning curve with DVC experiments, and getting my head around what persists / what doesn’t with experiments before and after git committing the current workspace. And also what remains to be accessible at different points in the process.
What I would like to know how to do, if it’s possible, is how to use an experiment name to dig into the DVC and / or Git space to access all input data, output data and artefacts associated with a particular experiment run. This could be at an arbitrary point in time when several Git commits and experiments have been done since the exp in question.
Note that to run experiments I’m using the command dvc exp run -S <param-1> -S <param-2> ...
Thanks.
You can use dvc exp apply
or dvc exp branch
to recreate the entire state of any experiment. If you want to retrieve specific artifacts without changing your entire workspace, you can use dvc get
or the Python API to and pass the SHA associated with the experiment.
Thank you. I tried both dvc get
and the Python API. Both approaches get the requested file if I don’t specify the experiment SHA (654c41d
) / name (tangy-inks
), but fail when I do.
Below are some commands and responses when using dvc get
:
# These 2 commands get the file and store it in the --out location.
dvc get data/preprocessed encoded_pat_data.parquet --out encoded_pat_data_tangy_inks.parquet
dvc get . data/preprocessed/encoded_pat_data.parquet --out encoded_pat_data_tangy_inks.parquet
# The following 2 commands give: ERROR: unexpected error - [Errno 2] No such file or directory: 'encoded_pat_data.parquet': ('encoded_pat_data.parquet',)
dvc get data/preprocessed encoded_pat_data.parquet --rev tangy-inks --out encoded_pat_data_tangy_inks.parquet
dvc get data/preprocessed encoded_pat_data.parquet --rev 654c41d --out encoded_pat_data_tangy_inks.parquet
When using the API here’s what happens:
Code:
import dvc.api
import pandas as pd
REV = 'tangy-inks'
# REV = '654c41d' # 'tangy-inks'
with dvc.api.open(
path="data/preprocessed/encoded_pat_data.parquet",
mode="rb",
# rev=REV
) as f:
print(pd.read_parquet(f))
Output when commenting out the rev=REV
line: Success: shows the expected DataFrame.
Error when uncommenting it:
PathMissingError: The path 'data/preprocessed/encoded_pat_data.parquet' does not exist in the target repository 'data/preprocessed/encoded_pat_data.parquet' neither as a DVC output nor as a Git-tracked file.
It looks like there is a typo in you dvc get
command. It should look like:
dvc get . data/preprocessed/encoded_pat_data.parquet --rev tangy-inks --out encoded_pat_data_tangy_inks.parquet
However, if dvc.api.open
is giving a similar error, it may mean that this path was not track as a DVC output in that experiment as the message suggests. You could try dvc exp apply
to retrieve the state of that experiment and see if data/preprocessed/encoded_pat_data.parquet
is tracked as a DVC output for that experiment.