[Question] DVC run and add


#1

Hi,

I would like to know whether we could store run info for reproducibility and add output data to a storage that can be than pulled.
For example, I have a raw dataset and a cleanup script

$ ls 
raw cleanup.py

I can run the cleanup

$ dvc run -d raw -o clean python cleanup.py raw clean
$ cat clean.dvc
cmd: python cleanup.py
deps:
- md5: xxxx
  path: raw
md5: yyyyy
outs:
- cache: true
  md5: zzzz
  path: clean

and I observe how clean folder is produced. However if I add clean folder with dvc in order to share it the information on how to produce clean folder is modifed

$ dvc add clean
$ cat clean.dvc
md5: xxxx
outs:
- cache: true
   md5: xxxxx

So, can we have both features : stored command on how dataset can be produced and is stored in the cache ?

Thanks


#2

Hi @vfdev.5 !

When you run dvc run -d raw -o clean python cleanup.py raw clean, output clean directory is automatically saved to cache(i.e. the same as if you would dvc add it) and clean.dvc contains information about both reproducibility and storage. So you shouldn’t run dvc add clean after dvc run -d raw -o clean python cleanup.py raw clean.

Thanks,
Ruslan


#3

Got it, thank you Ruslan !