Shared cache directory

Hi

I was wondering how to setup a shared cache directory, if possible?

Thanks
Matthias

Hi Matthias!

Thank you for the question! It is pretty simple actually, just specify your desired cache directory as a cache dir in the config. I.e. dvc config cache.dir /path/to/cache . Unfortunately we’ve recently discovered that we have a regression in that code and 0.9.7 is currently throwing an error if you try to use it. The fix for it is already merged and will be released in 0.9.8(probably early June).

Thanks,
Ruslan

Hi Ruslan,

I see “Support for configurable and shareable cache dirs” in the release notes for 0.9.7. This post has the only mention of cache.dir config I can find. And while helpful, I can’t make it work (even with dvc installed from github). Has any documentation been created? If not, could you provide perhaps an example for how you would modify the tutorial’s code to use a shared cache dir?

Thanks,
Ian

Hi Ian,

That feature is unfortunately broken in 0.9.7, but it is fixed in master and will be released in 0.9.8.
We wouldn’t recommend installing dvc directly from the master branch, as it is where the dev process is going and we can’t guarantee the stability of the code. That being said, if you do have dvc installed from the current master, all you need to do is to run dvc config cache.dir /path/to/cache/dir right after dvc init in the tutorial. There are no more specific docs about it right now, but we are working on it. If you are experiencing any problems, please feel free to report them(here or on github issues) so we could help.

Thanks,
Ruslan

Yes, of course. Due caution on master.

I see now that cache.dir config works, just not as I expected. I gather I have to get each data file into my project directory before using dvc add which then (in my case) moves the file into the shared cache directory and replaces it with a symlink. I was expecting the data would never have to pass (even briefly) through the project directory.

Ah, I see. What you want is not “external cache” but rather “external dependency/output”. We have added initial support for s3/gs/hdfs dependencies/outputs in the master(going to be released in 0.9.8), but I didn’t really get to supporting your(local deps/outs) case just yet. I’ve added https://github.com/iterative/dvc/issues/764 to 0.9.8 TODO so we can keep track of it. The patch will be on master in a few days or so.

Thanks,
Ruslan

Hi,

First of all, I want to say that I’m discovering DVC now and I think. it can become a very useful tool! Thanks for your contribution to the community.

I’m still assessing if it can be a good fit for our workflow.

Reading this thread I started to wonder:

Currently, I share a working machine (via ssh session) with other colleague(s). The way we do it is:

  • each one of us has it’s working dir to where we clone the same repo.
  • the data lives in a separate dir that is shared between us. This way we avoid data duplication.
  • when I make changes to the code I commit and push them, and my colleagues pull them. And vice-versa. And, implicitly, the data is already updated because the data directory is shared.

Could we use DVC with this setup? I guess that instead of having a shared data dir we would have a shared cache dir.

Do you see any problem with this setup?

Any plans for v0.9.8 release?

1 Like

Hi @andrethrill !

I think DVC with ‘shared cache dir’ would be the perfect fit for your setup :slight_smile: All the required patches are already merged into the master and they are going to be included in the next release.

We’ve decided to change our versioning scheme and thus v0.9.8 became v0.10.0. It is going to be released around Jul 5.

Thanks,
Ruslan

2 Likes

Thanks for the swift reply @kupruser

Looking forward for its release! :slight_smile:

@kupruser thank you for the answer!

@andrethrill we got a few requests for a shared cache directory in a single server. But people have different motivations for this setup.

Could you please share more details about your workflow and issues? It can help us cover your scenario better in the new version.

What are the issues that you have with the current approach:

  1. Code and data files inconsistency - you update the dataset and someone still works with old code.
  2. Data files size - you cannot keep more than 1-2 copies of the input dataset.
  3. * something else.

A couple additional questions:
4. With the current design DVC keeps all versions of data files (if you don’t garbage collect them implicitly). Is it okay for you to have many versions of the dataset?
5. Do you have a notion of ML-pipeline: Raw dataset --> (clean.py) --> dataset --> (model.py) --> a model.

Thanks,
Dmitry

Hi @dmitry,

First, a disclaimer that we are still not using DVC in our current workflow. But maybe we do want to start using it, hence my questions and why I’m currently researching it as a possible tool to add to our toolbox.

(sorry for the looooong answer)

Regarding our current workflow, as I referred before and now with a bit more detail, what we do is:

  • I share a working machine, via ssh session with other colleague(s). When I say machine it can be, in fact, a single machine or it can be a cluster where each one of us has ssh access to a pod but the data drive is shared amongst the different pods.
  • each one of us has it’s own working dir to where we clone the same repo.
  • the data lives in a separate dir that is shared between us. This way we avoid data duplication. Usually, there are no 2 programmers writing to the same data files. So there are no concurrency problems. (btw shared cache is protected against concurrency problems, right?)
  • when I make changes to the code I commit and push them, and my colleagues pull them to update their code. And vice-versa. And, implicitly, the data modified by such code changes is already updated because the data directory is shared.

This is, of course, during an exploration phase of our projects where we are performing data wrangling and developing models. This is not a production setup.

The main immediate issue where I see DVC helping us is:

  • we take some raw data, we do an initial cleaning and write it to a v1 file, then we take v1 compute some operations (feature engineering, resampling, merging with other data files, etc) and save it to v2, do some more operations and save to v3, etc… and we end up with data files which dependencies look like:
    • v1 → v2 → v3 → v4 ->v5 … → v20
  • in reality, it’s unlikely we reach v20, but bear with me :slight_smile: Also, the DAG can have more branches with more than one dependency converging, this one above is just a simple linear case)
  • When I’m coding the script that will generate v21 from v20 I may suddenly realise that I should’ve done something different in the script that generates v3. So I want to update that script, run dvc repro v20 and let it recompute the required, and only the required, dependencies needing to be updated.

Other DVC functionalities such as being able to keep track of different data files versions per branches/commits is a welcome bonus, of course, but it is not our immediate need (yet, it will be soon).

I’m aware the workflow above may be far from perfect, but it seems to be the best we found so far. If you have any input, it’s more than welcome :wink:

Now to finally answer your questions regarding the shared cache:

  • The main reason I was looking for shared cache is: The data files that we work with are big, dozens of GBs. Just the raw files. As we start to save the processed version as I referred above, it escalates quite quickly… If we can share the cache among the different users it will save us a considerate amount of space.
  1. With the current design DVC keeps all versions of data files (if you don’t garbage collect them implicitly). Is it okay for you to have many versions of the dataset?

I have a few questions regarding this point:

  • If we are using shared cache and if I run dvc gc will it keep the files that I need and the files that my colleagues sharing the cache with me need? (even if they are working in a different branch that I am working at the moment)?

  • Considering the simple case where the cache is not shared, I would like to confirm the following: if we check out to a branch that uses files that we previously garbage collected, they will be recalculated when we do dvc repro right? DVC will make sure all the required raw files (together with the respective scripts) are kept in the cache correct?

Assuming the answer is yes to both questions, then I am ok with either behaviour: the current behaviour where unneeded files are cleaned only when we garbage collect it manually or have it run automatically when we checkout, for instance (maybe add this as dvc config option?)

  1. Do you have a notion of ML-pipeline: Raw dataset → (clean.py) → dataset → (model.py) → a model.

Sorry, I’m not sure if I understood this question, maybe you can rephrase it? But, I think I have notion yes. In fact, what I describe above about the different data versions, usually they just correspond to the first stage (clean.py), but in our case, it can get complex and we divide it into different substages with intermediate datasets.

I hope this was not very confusing, please feel free to come back to me with any input or questions,

André

1 Like

Hey Andre, thank you for the details! It is very helpful!

Yes, the next release will cover your scenario with a single machine and many users:

  1. Yes, in the new release shared cache is protected against concurrency problems.

  2. dvc supports pipelines v1 -> v2 -> v3 -> v4 ->v5 … -> v20 including not linear graphs (no cycles though):

$ dvc run -d input.zip -o v1 python script1.py
$ dvc run -d v1 -o v2 python script2.py
$ dvc run -d v2 -o v3 python script3.py
...
$ dvc run -d v19 -o v20 python script20.py

so, $ vi script15.py; dvc repro is going to reproduce v15 … v20.

  1. $ dvc gc - removes everything except the current branch\commit. So, some of your colleagues might lose some of their data files (if she is a couple commits behind you). Advanced GC strategies are coming to avoid these kinds of issues. But not in the current release unfortunately: dvc gc --time month and LRU auto-strategy for GC

  2. You are right - dvc repro will reproduce the data files that you have just collected by GC or manually removed.

  3. ML-pipeline is exactly what you describe by v1 --> v2 … --> v20

For the shared drive and multiple pods scenarios - it depends on how you share (NFS, SFTP etc). It would be great to know about your current approach, issues, and expectations. So, we can understand how DVC can help.

And thank you again! Based on your scenario it looks like we should prioritize the advanced GC strategies. It will help you to keep a reasonable cache size and minimize needs in dvc repro. We will include these in the next release.

Please don’t hesitate to share your feedback. It helps us to create better DVC :slight_smile:

Hi Dmitry,

Thanks again for your answer and for all your feedback and support.

My comments follow below:

This is exactly what we are looking for atm!

Just to be sure, and because I’m still learning how to properly use DVC: dvc run -d v2 -o v3 python script3.py should be dvc run -d v2 -d script3.py -o v3 python script3.py correct?

The ideal case (for us at least) would be to keep some state file in cache, so that if I am working on branch branch1, my colleague is working on branch2 and if I run dvc gc it will keep the files needed for branch1 and branch2 in cache (or at least raise some warnings before actually collecting it).
That would be the ideal case, but I believe the solutions you propose are reasonable workarounds.

We use an internally developed tool that mixes GlusterFS with heketi. But I think that the important point is that, each machine, sees the volume as a regular local Data Drive. Although I believe it’s not very relevant for this discussion, if you are interested in more details on our architecture you can watch the (slightly already out of date) presentation we did in the 5th Lisbon PyData here (from beginning of the video until ~20min)

Those are great news! Thank you very much as well for you direct attention and support!

All in all, I’m really enjoying discovering DVC. There are still these details we are discussing here that worry me a bit but! We will probably start test-driving DVC soon and I’m sure we will have better feedback by then.

Absolutely!

Could you please clarify what does state file means? If you are talking about dvc files - we keep all of them in Git - no GC at all.
Another workaround - call dvc push frequently to sync data in a data remote (like S3 or a raw server by rsync). So, if something was GC-ed you can get a result from the data remote - dvc pull - without reproduction.

Yeah, let’s not discuss the distributed system stuff in this thread. Thank for the video though.

Looking forward to your feedback!

I meant a way for the garbage collector know in which branches/commits other users are working on (that share this same cache) and this way, the garbage collector wouldn’t collect needed files. But I understand that may be difficult to implement.

1 Like