Workflow for editing code while using `dvc apply` to get data

I would like to use DVC to only get the data (from the cache) from a past experiment, without changing my (carefully hacked) source code to the state it was in when I ran the experiment. dvc exp apply affects both code and data and as far as I can see cannot do anything else. Is there an alternative way to achieve this idealised workflow (which does not exist):

  • Run an experiment dvc exp run -n expt-1
  • Do lots more development work, including running more experiments
  • dvc apply expt-1 –-just_the_data_thanks
  • dvc exp run -n expt-1-data-with-new-code

Thanks :slight_smile:

Interesting usecase. How is this data tracked?

If it’s inside .dvc files, you can checkout that specific file from the Git revision, and then run checkout.

Eg:


git checkout <rev> -- /path/to/data.dvc
dvc checkout

Alternatively, you can use dvc get to download from your other revisions.

dvc get . /relative/path/to/your/data/from/root -o out_dir –rev <rev>

Eg:

dvc get . data --rev expt-1

The data is stored on a server and copied locally to work/data.raw at the start of the experiment. My processing code works on work\data.raw

DVC caches it, so when I do dvc exp apply, it updates the content of work (but also the code).

Extracts from various files in case it helps. I am very new to DVC so it is quite possible I have come up with some kind of anti-pattern!

dvc.yaml:

params:
  - params.yaml

vars:
  - work_path: ./work/
  - raw_file: data.raw

stages:

  get:
    cmd: python -m expt.copy_data_file

    params: 
      - server_path
      - source_file
    deps:
      - ${server_path}${source_file}
    outs:
      - ${work_path}${raw_file}

params.yaml:

server_path: z:/path/path
source_file: filename.txt

copy_data_file.py

dvc_vars = dvc_utils.get_vars()
params = dvc_utils.get_params()
srcpath = os.path.join(params['server_path'], params['source_file'])
dstpath = os.path.join(dvc_vars['work_path'], dvc_vars['raw_file'])
shutil.copy2(srcpath, dstpath)

If it’s an output from stages/pipelines, I’d suggest using dvc get.

dvc get . work/data.raw --rev expt-1 -o mydata.raw

Sorry, I’m not sure how this helps, and that is probably because I have oversimplified and not communicated my setup well enough.

My stages are more complex in reality, of this form:

stages:
  get:
    cmd: python -m expt.copy_data_file
    params: 
      - server_path
      - source_file
    deps:
      - ${server_path}${source_file}
    outs:
      - ${work_path}${raw_file}
  
  step1:
    cmd: python -m expt.step1
    deps:
      - ${work_path}${raw_file}
    outs:
      - ${work_path}${step1_file}

  step2:
    cmd: python -m expt.step2
    deps:
      - ${work_path}${step1_file}
    outs:
      - ${work_path}${step2_file}


While debugging step 2 code (which I do using a notebook, as I can iterate much more quickly, and I want to dig into the sub-sections of step 2 to debug).

So, I want to easily retrieve various past step1_file s to use as inputs for step2’s subprocesses without replacing the code for step2 that I am currently debugging.