Hi,
I have a question about the following scenario:
Let data1 and data2 be two data folders, script.py a script and the command to run the script like: python script.py --datadir ./df where df is equal to data1 or data2.
What is a good way to setup the stage now? I thought about these options:
one stage + make datadir a parameter -> dvc will not update after changes in data1/data2
one stage + just one data folder -> I have to checkout the right data all the time (for example tests) and maybe another script needs both folders at the same time.
two stages, one for each datafolder -> I would have to add extra argument “stage” to script.py to load the right parameters.
Thank you for your fast reply. So the original motivation was to have a simple and fast way to test the script. I have a normal stage which is part of a pipeline and a test stage for the script. The test stage has its own small dataset and passes “short running” parameters to the script.
Then I wanted to introduce parameters to the normal stage, but this would break the test stage. I think the answers to your questions are: rather parallel and yes.
I think I will look into the new parameterization api.
I have just one stage now and submission is either test or normal. It looks like dvc is still caching the outdir. Is it correct that no caching is not supported yet? Will it be?
Could you please clarify what do you mean by caching? Don’t you want to have this file in your cache and storage?
cache: false in the outputs means the output is not tracked by DVC and should be committed to Git. It was mostly created to small files that you’d prefer to commit. It feels like your goal is a bit different. I’d appreciate your clarification.
datapath is a folder with 7GB and script.py produces a folder outdir with 7GB. I want that dvc tracks the hashsum/s of outdir but not the files themselves . Like the -O option in dvc run as far as I understand. But it seems that the files are copied to the .cache as it constantly grows by several GB.
In your yaml file you clearly say cache: false - it should not go to the cache dir. Have you checked if that is the result of these stages and not the downstream ones? You can check that by trying to find datafiles in .dvc/cache/ by the md5 from dvc.lock.
It might be just a bug in new pipelines or in run-cache. @skshetry & @kupruser I’d appreciate it if you guys could take a look.
cache: false is a bit misleading, as it still tries to cache the results for the run-cache.
It will only prevent pushing those artifacts right now. There’s --no-run-cache option that
will skip using run-cache, but it still tries to save the run-cache.
Thank you for your answer as well. This explains lots of the behavior, but I’m still confused. I always thought it worked like this:
dvc run -o file ...: git like versioning of file (I would call this strong versioning). file can be restored from archive/.cache
dvc run -O file ...: I would call this weak versioning. If the pipeline is deterministic, file can get restored by running the pipeline again. Changes are detected by storing hashsums. Typical tradeoff between memory and time.
Did this change in 1.0?
I ran dvc run -n test -O hello.txt --no-run-cache "echo hello > hello.txt". Is it correct that the corresponding stage in the dvc.yaml doesn’t change at all? Will this limit the memory usage now?
Changes are detected by storing hashsums. Typical tradeoff between memory and time.
This did not change in 1.0. Checksum are now stored in dvc.lock. Regretfully, as its mentioned in the issue posted by @skshetry dvc still tries to make use of run cache, feature introduced in 1.0.
Can I ask you to comment there that you have issues with that too? It seems like a bug to me. I already wrote a comment there.
Hi,
with my last answer I just wanted to tell what I think how things should work. Just to clarify: The checksums are calculated and saved correctly. Basically all I want is to limit the disk space for stages that are deterministic and produce big data folders. ( I know that I can use the garbage collection command. ) Do you want me to make a comment in the mentioned issue?
Thanks again