I’ve been looking through the documentation and questions for the past days and cannot wrap my head around how to use DVC with the following setup. Or maybe I just need some confirmation, that how I want to do it is reasonable.
As I’m not 100% sure what would all be relevant to provide here at the moment I will try to give a brief description which I can improve on your request(s).
- I have a (remote) network drive containing the data (Windows network drive, NTFS) for my projects (mounted locally). I think it is good practice, just in case we need it later for a different project, to make this into a data registry. This is where a first issue arises. First, what I did: 1. I initialize a git + dvc repo on my laptop locally, 2. I setup a remote cache on the network drive.
Question: Can I add data that is not in the workspace to the data registry? The mounting point is the same for everybody within the company.
Now, assuming I manage to setup the data registry properly, i.e., it works. I have another issue.
- I have another local git + dvc project which will contain the code to train some ML on the data. However, the training is not done on my laptop, but on a remote “GPU cluster”. I don’t need the the data on my laptop, but directly to the GPU cluster as it has its own faster storage space.
Question: How do I achieve copying the data directly from the data-registries storage (cache or remote?) without losing the data versioning connection?
I am considering to import (dvc import) the data and mount the GPU cluster storage to a folder within the local dvc project. Where I would store the dvc file locally and the data would be downloaded/uploaded to the GPU cluster.
Question: Is this a good approach? Or is there a better way?
Hope somebody can help. Please ask for specification if you need it.
Thank you!