Does DVC and a LAN infrastructure where Git repos are not in the computing server?

I’m developing a Machine Learning group with a colleagues. The local area network infrastructure will be:

  • Git server (one machine)
  • Computing-data server (another machine)
  • Local machines for users

The way of working will be:

  • Each user will have their own git-repos in their local machines (with they authentication keys in their local machines).
  • Data and scripts have to be transferred to the computing server (data can be temporally in the server, only while is needed by the ML scripts)
  • Users launch the ML operations from their local machines via SSH in the computing server

We are performing some tests, and from now, we are doing synchronization operations in the DVC chain: first ,code and data are synched between the local machine and the computing server via SSH. After, the training is launch via SSH. Next, the outputs are transfered back from server to local machines, so that DVC can track the changes, and son on…

Questions and issues (among others):

  1. The path for the *.dvc file can be specified? (one path for each model)
  2. Can DVC be in the computing server while users have their git-repos in the local machines?

Thank you very much for your effort, we really appreciate your work.

3 Likes

Hi Ignatio!

Thank you for trying out dvc!

  1. Yes, it can. Please see -c and -f options for dvc run. I.e. -c specifies the directory that your command is ran relative to and -f specifies name of the dvcfile. So, for example, if you run dvc run -c dir -f my.dvc cp ../foo bar then you will have foo copied into dir/bar and dir/my.dvc dvcfile will be created.

  2. Unfortunately as of right now dvc doesn’t support remote execution by itself, but we do have plans to implement that in the future. However, I see that you are manually transferring back the data from your computing server, which is not optimal. Have you considered using dvc push/pull to sync your data? I.e. you could add s3/gcp/sftp server that will store your data and do ‘dvc push/pull’ at the same time when you are doing ‘git push/pull’, thus not having to manually transfer any data. I.e.:

    1. users work on their local machines, pushing changes to the git repo, as usual
    2. when computing server is done building(dvc repro), it commits changes to git, runs git push and dvc push to push the data to the cloud
    3. now users on their local machines can sync their local repos with git pull and their data with dvc pull
1 Like

Thank you very much, for your fast response. We well carry on performing tests following your recommendations, we’ll let you know our progress (if any).