Parameterlike dependencies

Hi,
I have a question about the following scenario:
Let data1 and data2 be two data folders, script.py a script and the command to run the script like: python script.py --datadir ./df where df is equal to data1 or data2.
What is a good way to setup the stage now? I thought about these options:

  1. one stage + make datadir a parameter -> dvc will not update after changes in data1/data2
  2. one stage + just one data folder -> I have to checkout the right data all the time (for example tests) and maybe another script needs both folders at the same time.
  3. two stages, one for each datafolder -> I would have to add extra argument “stage” to script.py to load the right parameters.

Is there a better way to do this?

Thanks for your great work by the way : )

1 Like

Hi @xyz
I’d appreciate it if you could clarify the question a bit - what is the motivation to have two stages?

  1. Do you need two “parallel” stages in the pipeline or a sequence of the stages?
  2. Will you need two sets of parameters for each of the two runs?

PS: More advanced pipeline parameterization functionality is coming in DVC 2.0 in a month or so. Now it is available only in test mode: https://github.com/iterative/dvc/wiki/Parametrization
The initial feature request: https://github.com/iterative/dvc/issues/3633

1 Like

Thank you for your fast reply. So the original motivation was to have a simple and fast way to test the script. I have a normal stage which is part of a pipeline and a test stage for the script. The test stage has its own small dataset and passes “short running” parameters to the script.

Then I wanted to introduce parameters to the normal stage, but this would break the test stage. I think the answers to your questions are: rather parallel and yes.
I think I will look into the new parameterization api.

The new functionality should help you with that. Please let me know if it does not - we will try to make it work.

Hello again ; ),
I tried the new parametrization tool and have a question now. My (simplified) dvc.yaml looks like this:

stages:
  Submission:
    foreach: ${submission}
    do:
      cmd: >-
        python submit.py ... 
      deps:
      - ${item.datapath}
      - submit.py
      - ${item.model}
      outs:
      - ${item.outdir}:
          cache: false

I have just one stage now and submission is either test or normal. It looks like dvc is still caching the outdir. Is it correct that no caching is not supported yet? Will it be?

Best wishes

Hi @xyz!

Could you please clarify what do you mean by caching? Don’t you want to have this file in your cache and storage?

cache: false in the outputs means the output is not tracked by DVC and should be committed to Git. It was mostly created to small files that you’d prefer to commit. It feels like your goal is a bit different. I’d appreciate your clarification.

datapath is a folder with 7GB and script.py produces a folder outdir with 7GB. I want that dvc tracks the hashsum/s of outdir but not the files themselves . Like the -O option in dvc run as far as I understand. But it seems that the files are copied to the .cache as it constantly grows by several GB.

Thank you. Now I understand your use case better.

In your yaml file you clearly say cache: false - it should not go to the cache dir. Have you checked if that is the result of these stages and not the downstream ones? You can check that by trying to find datafiles in .dvc/cache/ by the md5 from dvc.lock.

:thinking: It might be just a bug in new pipelines or in run-cache. @skshetry & @kupruser I’d appreciate it if you guys could take a look.

cache: false is a bit misleading, as it still tries to cache the results for the run-cache.
It will only prevent pushing those artifacts right now. There’s --no-run-cache option that
will skip using run-cache, but it still tries to save the run-cache.

Thank you for your answer as well. This explains lots of the behavior, but I’m still confused. I always thought it worked like this:

  • dvc run -o file ...: git like versioning of file (I would call this strong versioning). file can be restored from archive/.cache
  • dvc run -O file ...: I would call this weak versioning. If the pipeline is deterministic, file can get restored by running the pipeline again. Changes are detected by storing hashsums. Typical tradeoff between memory and time.

Did this change in 1.0?

I ran dvc run -n test -O hello.txt --no-run-cache "echo hello > hello.txt". Is it correct that the corresponding stage in the dvc.yaml doesn’t change at all? Will this limit the memory usage now?

  test:
    cmd: echo hello > hello.txt
    outs:
    - hello.txt:
        cache: false

I have to say, that it never crossed my mind that this option (-O) is for small git versioned files.

@xyz

Changes are detected by storing hashsums. Typical tradeoff between memory and time.

This did not change in 1.0. Checksum are now stored in dvc.lock. Regretfully, as its mentioned in the issue posted by @skshetry dvc still tries to make use of run cache, feature introduced in 1.0.
Can I ask you to comment there that you have issues with that too? It seems like a bug to me. I already wrote a comment there.

Hi,
with my last answer I just wanted to tell what I think how things should work. Just to clarify: The checksums are calculated and saved correctly. Basically all I want is to limit the disk space for stages that are deterministic and produce big data folders. ( I know that I can use the garbage collection command. ) Do you want me to make a comment in the mentioned issue?
Thanks again :grinning:

@xyz sure! Usually we discuss validity, their importance and how severe they are and if we know particular use cases its much easier to estimate.