Data (registry) and remote GPU cluster with local DVC repositories

I’ve been looking through the documentation and questions for the past days and cannot wrap my head around how to use DVC with the following setup. Or maybe I just need some confirmation, that how I want to do it is reasonable.

As I’m not 100% sure what would all be relevant to provide here at the moment I will try to give a brief description which I can improve on your request(s).

  1. I have a (remote) network drive containing the data (Windows network drive, NTFS) for my projects (mounted locally). I think it is good practice, just in case we need it later for a different project, to make this into a data registry. This is where a first issue arises. First, what I did: 1. I initialize a git + dvc repo on my laptop locally, 2. I setup a remote cache on the network drive.

Question: Can I add data that is not in the workspace to the data registry? The mounting point is the same for everybody within the company.

Now, assuming I manage to setup the data registry properly, i.e., it works. I have another issue.

  1. I have another local git + dvc project which will contain the code to train some ML on the data. However, the training is not done on my laptop, but on a remote “GPU cluster”. I don’t need the the data on my laptop, but directly to the GPU cluster as it has its own faster storage space.

Question: How do I achieve copying the data directly from the data-registries storage (cache or remote?) without losing the data versioning connection?

I am considering to import (dvc import) the data and mount the GPU cluster storage to a folder within the local dvc project. Where I would store the dvc file locally and the data would be downloaded/uploaded to the GPU cluster.

Question: Is this a good approach? Or is there a better way?

Hope somebody can help. Please ask for specification if you need it.

Thank you!

1 Like

@RiCk questions:

  • can you / do you mount the remote network drive to the GPU cluster? If not, how do you get data on it now?
  • do you have multiple projects (git repositories). Do you use same data across many projects?
  • what kind of data do you have in the storage / how do you add it / delete, do you have annotations / labels?

Sorry for a lot of question, but they might help us to recommend you a better setup

@shcheklein No worries, I am happy to answer the questions.

  1. We don’t mount the remote network drive to the GPU cluster. Might be possible though, but I think access to the data will be faster if it exists on the storage of the GPU cluster. I think it is okay for us to have this copy if necessary. Currently, I don’t use DVC yet, so the data is just copied. I was thinking to do it like I suggested in the original post (I added some clarification):

I am considering to import (dvc import) the data (locally, laptop) and mount the GPU cluster storage (sshfs) to a folder within the local dvc project. Where I would store the dvc file locally (so not in the mounted folder) and the data would be downloaded/uploaded to the GPU cluster (because I store it in the folder in the local project, which is actually connected to the cluster).

  1. Currently, we don’t have this, but I would like to set it up in such a way it is possible for data to be reused in multiple projects. I imagine there to be multiple data registries, as a “project” from our perspective can consist of different (practical) problems with corresponding dataset and we should be able to limit who has access to what.

  2. I’m currently just testing with some csv files, but my data will be images with annotations. The images themselves likely don’t change. Annotations likely will. Currently there is no version control at all. So we just get new images and put them on the network drive. Same for annotations. Note that there will be all kinds of data stored eventually.

Hope this clarifies things. Thank you!

Sounds a bit fragile to me and complicated. Can be potentially done. Let’s try to brainstorm a bit more.

How do you expect / run training on the cluster? Can you let people SSH directly to that machine(s), do git clone + dvc pull (to copy data), and then run train.py or dvc exp run or dvc repro?

The way I would try to organize (high level for now), it will require an extra copy of data on NAS (NTFS) - can we tolerate it? (or if you decide that you are fine with DVC data registry initial data can be removed):

  • On the NAS we can allocate a separate directory cache
  • All DVC projects (including the data registry one if we need it - TBD) will be using this as a cache dir: e.g. dvc cache dir /mnt/nas/cache .
  • You can “bootstrap” the cache with dvc add /mnt/nas/data -o data in the project (setup it to use symlinks first to avoid extra local copy)
  • GPU cluster - also mount /mnt/nas/cache, set on GPU cache config to use copy, so when you do dvc pull on that machine it will be copying files to instantiate it.

I’m currently just testing with some csv files, but my data will be images with annotations.

Usually, I know that DS/ML folks need also a way to query /instantiate specific subsets of these data using these annotations. Is that the case? Do expect this case?

I imagine there to be multiple data registries, as a “project” from our perspective can consist of different (practical) problems with corresponding dataset and we should be able to limit who has access to what.

can be done, yes but splitting the data into data1dataN and splitting cache into cache1cacheN

Let me know if that makes sense, we try to put some scripts / commands together to make it work.

@shcheklein

Sounds a bit fragile to me and complicated. Can be potentially done. Let’s try to brainstorm a bit more.

I felt the same, but wanted to make it work without having to run any dvc commands on the GPU cluster itself. However, I think this is not strictly necessary. Our infrastructure is about to change so the setup might be different anyway. So I now open a ssh connection through vscode and work in a devcontainer such that I can install what I need (this was/is a problem with the GPU cluster)

Assuming this setup (so not locally running dvc anymore, but directly on the GPU cluster) your suggestion with the cache on the NAS seems good to me. However, I need to discuss with the admins here if this is possible. I think it shouldn’t be a problem.

Question: The experiments will generate models/checkpoints as well, which will be tracked using dvc as well. With the suggested setup these checkpoints would then also be written to the cache on the NAS? Or is there a way to separate the projects’ cache locations, i.e., the cache of the data registry and the experiments without duplicating the data too much (I guess max 2 times, NAS and GPU cluster)?

Usually, I know that DS/ML folks need also a way to query /instantiate specific subsets of these data using these annotations. Is that the case? Do expect this case?

Not at the moment. How would one solve this problem? Create multiple (git) branches or tags with subsets of the data? Or is there a dvc supported dynamic approach?

I might be getting a bit confused, please let me know if certain question are not relevant and it is supposed to be done/used differently.

Thank you!

I might be getting a bit confused, please let me know if certain question are not relevant and it is supposed to be done/used differently.

No, I believe you asking all the right questions. And I know it can be intimidating a bit. DVC has multiple parts - cache, remote storage, etc that could be combined in different ways to achieve certain results. I feel the same when I’m trying to come up with a way to document or explain this - hard to generalize and every case depends on the requirements, etc.

Yes, but I feel it should not be a problem. There dvc gc that you could run from time to time to clean it. The fact that you have data and models, and checkpoints in the same cache would not bother me much though. There should not be much overhead compared to the data size anyways, right?

A better question is security as you mentioned before. That’s where you would want to split projects into multiple caches potentially.

Not at the moment. How would one solve this problem? Create multiple (git) branches or tags with subsets of the data? Or is there a dvc supported dynamic approach?

There is no good approach in DVC for this atm out of the box. DVC doesn’t “index” annotations and doesn’t provide a way to do something like - “please give me only cats and dogs” from this dataset. It can be achieved with DVC though, but you would still need to provide some way to collect annotations and query them, e.g. at least put them into a single CSV/TSV files in DVC, or some database to be able to get file names that match certain query. Then you can indeed create different branches or even directories the different subset of files in them.

Also, I would chat with @volkfox about this. We have a project in private alpha that might help you with managing labels / annotations.

Assuming this setup (so not locally running dvc anymore, but directly on the GPU cluster) your suggestion with the cache on the NAS seems good to me. However, I need to discuss with the admins here if this is possible. I think it shouldn’t be a problem.

Please let me know how it goes, happy to help you setup everything properly.

So I now open a ssh connection through vscode and work in a devcontainer such that I can install what I need (this was/is a problem with the GPU cluster)

Have you seen the DVC VS Code extension, btw? Would be cool to know if it fits this workflow that you have in mind.

Thank you! I think it’s best I just try what we discussed and then see if it works for us. The questions and suggestion have been very useful already.

Regarding the project that helps managing labels/annotation. Currently, I do not need this. However, for other projects in my company this might be interesting! Are you looking for feedback, then I could ask around. We work a lot with this kind of data.

I will let you know how it goes, or ask another question if I run into problems (For example: DVClive + MMCV in Container).

I have seen the DVC VS code extension! I think its awesome. Would like the experiments to show all-branches though. Saw this already commented/requested somewhere.

Thanks for the help!