DVC run and add: store command and data


I would like to know whether we could store run info for reproducibility and add output data to a storage that can be than pulled.
For example, I have a raw dataset and a cleanup script

$ ls 
raw cleanup.py

I can run the cleanup

$ dvc run -d raw -o clean python cleanup.py raw clean
$ cat clean.dvc
cmd: python cleanup.py
- md5: xxxx
  path: raw
md5: yyyyy
- cache: true
  md5: zzzz
  path: clean

and I observe how clean folder is produced. However if I add clean folder with dvc in order to share it the information on how to produce clean folder is modifed

$ dvc add clean
$ cat clean.dvc
md5: xxxx
- cache: true
   md5: xxxxx

So, can we have both features : stored command on how dataset can be produced and is stored in the cache ?


1 Like

Hi @vfdev.5 !

When you run dvc run -d raw -o clean python cleanup.py raw clean, output clean directory is automatically saved to cache(i.e. the same as if you would dvc add it) and clean.dvc contains information about both reproducibility and storage. So you shouldn’t run dvc add clean after dvc run -d raw -o clean python cleanup.py raw clean.


1 Like

Got it, thank you Ruslan !