Multiple machines setup for one repo

simon · October 29, 2020, 6:52am

Hi,

First of all - thanks for developing such a great tool! We are currently incorporating it in our deep learning workflow for different computer vision projects.

I think we got a grasp of how it all should be set up and are able to run our pipelines properly on one machine. One big question that we still can’t figure out is the best approach to experiment and version control with dvc simultaneuosly on several machines.

Create separate branches per each machine and iterate there? But what about shared data (the very first dependency in our pipeline) then? Do we just dvc add data and commit those changes in one branch (e.g. with tag). And then git checkout <tag> data.dvc and dvc checkout data.dvc on other machines?

dmitry · October 29, 2020, 8:16am

Hi @simon, good question!

Yep. This will work if you checkout *.dvc file and then do dvc checkout.

Another way is to set up a data registry in a separate repository. Users will be doing:

dvc import https://github.com/iterative/dataset-registry tutorial/nlp/Posts.xml.zip

# something changed
dvc update Posts.xml.zip.dvc

# get a specific version
dvc update -r my-tag-2.3 Posts.xml.zip.dvc

If you need optimizations you can investigate more advanced scenarios:

shared development server scenarios - all users in a single machine with no data duplication
Set up data cache in a shared NFS drive and symlinks on the user side - users will have a nice looking workspace with all the files (symlinks actually) but will be reading actual data from NFS.

simon · October 29, 2020, 11:57pm

Thanks, @dmitry for your response, much appreciated! Data registries makes so much sense!

We did set up shared cache on each of our GPU servers and experiment there as per shared development scenario, thanks! And we have our one remote for all projects on one of the servers too.

All our servers can only be accesses through SSH due to security requirements which complicated things a bit. We ended up creating a special dvcuser with very restricted premissions (just to be able to push and pull data to remote). That way we can include its ssh credentials for that remote in .dvc/config so everyone can clone any dvc project, get those credentials pulled and start experimenting. Scenario with accessing that ssh remote with their own credentials by including them in .dvc/config.local was not considered safe enough.

dmitry · October 30, 2020, 5:09am

The design looks good. I hope the data registry will compliment your scenario.
Please let us know if any other questions. We are always happy to help

Topic		Replies	Views
Data (registry) and remote GPU cluster with local DVC repositories Questions	6	715	July 5, 2022
Dvc_api.get_url is not working with external data Questions	10	1024	June 28, 2022
Version control of the raw data with the colleagues simultaneously Questions	5	675	April 14, 2022
Large Data Registry on NAS with multiple DVC and non-DVC users Questions	8	886	August 21, 2022
Looking for Workflow Suggestion Questions	2	186	December 21, 2023

Multiple machines setup for one repo

Related topics