First of all - thanks for developing such a great tool! We are currently incorporating it in our deep learning workflow for different computer vision projects.
I think we got a grasp of how it all should be set up and are able to run our pipelines properly on one machine. One big question that we still can’t figure out is the best approach to experiment and version control with dvc simultaneuosly on several machines.
Create separate branches per each machine and iterate there? But what about shared data (the very first dependency in our pipeline) then? Do we just
dvc add data and commit those changes in one branch (e.g. with tag). And then
git checkout <tag> data.dvc and
dvc checkout data.dvc on other machines?
Hi @simon, good question!
Yep. This will work if you checkout
*.dvc file and then do
Another way is to set up a data registry in a separate repository. Users will be doing:
dvc import https://github.com/iterative/dataset-registry tutorial/nlp/Posts.xml.zip
# something changed
dvc update Posts.xml.zip.dvc
# get a specific version
dvc update -r my-tag-2.3 Posts.xml.zip.dvc
If you need optimizations you can investigate more advanced scenarios:
shared development server scenarios - all users in a single machine with no data duplication
- Set up data cache in a shared NFS drive and symlinks on the user side - users will have a nice looking workspace with all the files (symlinks actually) but will be reading actual data from NFS.
Thanks, @dmitry for your response, much appreciated! Data registries makes so much sense!
We did set up shared cache on each of our GPU servers and experiment there as per shared development scenario, thanks! And we have our one remote for all projects on one of the servers too.
All our servers can only be accesses through SSH due to security requirements which complicated things a bit. We ended up creating a special
dvcuser with very restricted premissions (just to be able to push and pull data to remote). That way we can include its ssh credentials for that remote in
.dvc/config so everyone can clone any dvc project, get those credentials pulled and start experimenting. Scenario with accessing that ssh remote with their own credentials by including them in
.dvc/config.local was not considered safe enough.
The design looks good. I hope the data registry will compliment your scenario.
Please let us know if any other questions. We are always happy to help