DVClive + MMCV in Container

RiCk · July 4, 2022, 2:29pm

Hi I’m running into problems using the DVClive hook for MMCV running in a separate “training” container.

I’m running my dvc exp run from within a devcontainer (on a remote host). The actual training is run on the host in another container (that has all the right dependencies installed). Within my config.py I included the DvcliveLoggerHook as per the documentation. I noticed the interval and by_epoch arguments where not applying. I.e., I wasn’t seeing any step 1, 2, 3 etc. (until max_epochs) in my dvc exp show.

I realized after some hours of reading the documentation and trying out different configs, that it might be because Live.set_step() is called from within a DvcliveLoggerHook running in one container and the DVC “instance” in another. So there is no DVC to notify in the container that is trying to notify DVC.

Could this be the problem? Is there a way to fix it?

One way to “fix”, i.e., to work around the problem, I can imagine is just to install everything in the devcontainer. This is doable for the current container I need, but I would rather not.

Thank you!

daavoo · July 4, 2022, 3:08pm

Hi @RiCk ! Could you share some more details on your setup?

How is The actual training is run on the host in another container reflected in your dvc.yaml?

RiCk · July 5, 2022, 6:46am

Hi @daavoo! cmd: /bin/bash train.sh where train.sh contains a docker run command linking some folders so data can be read and results can be written within the devcontainer. Including a command that is then executed in this container performing the actual training.

stages:
  train:
    cmd: /bin/bash src/train.sh
    deps:
    - datasets/dataset1
    - src
    params:
    - optimizer.type
    - optimizer.lr
    - optimizer.momentum
    - optimizer.weight_decay
    outs:
    - results/checkpoints/model.pth:
        checkpoint: true 
    metrics:
    - results/dvclive.json: 
        cache: false
    plots:
    - results/dvclive/scalars:
        cache: false

Thank you!

daavoo · July 5, 2022, 1:09pm

Thanks for the additional details. I think the issue is not that DVCLive is not actually logging but that you are expecting DVC checkpoints to be active.

You should be able to verify that DVCLive is logging correctly by checking that the contents of the files logged in results/dvclive/scalars/ contain rows for the interval / epochs.

Regarding DVC checkpoints, I think that there are 2 different issues.

First, DVCLive relies on DVC exposing an env var called DVC_CHECKPOINT:

github.com

iterative/dvc/blob/eed6a8483cdefc4e52dc65068f8ce3456ef81b12/dvc/stage/init.py#L289-L291


      
          from dvc.env import DVC_CHECKPOINT
          
          
env.update({DVC_CHECKPOINT: "1"})

I suspect that this env var is not being passed to the docker run so DVCLive is not finding it.

Second, after DVC checkpoints have been activated, DVCLive creates a signal file inside the .dvc folder, so you would also need to mount it in order to allow DVCLive trigger the DVC checkpoint:

github.com

iterative/dvclive/blob/b16f60a089a8bc40d9a620417a2b87f6cf19ce9e/dvclive/dvc.py#L35-L39


      
          root_dir = _find_dvc_root()
          if not root_dir:
              return
          
          
signal_file = os.path.join(root_dir, ".dvc", "tmp", env.DVC_CHECKPOINT)

RiCk · July 5, 2022, 1:26pm

Thank you! So I do the following:

Pass DVC_CHECKPOINT set in the devcontainer as env variable to the container that does the training? docker run -e DVC_CHECKPOINT=$DVC_CHECKPOINT etc.
Mount the .dvc folder. I’m not sure where to mount the .dvc folder though. docker run -v .dvc:???

EDIT: Sorry read over your first comment. The results/dvclive/scalars do contain rows for interval/step. However, I tried to change the step to be epoch (by_epoch=True) and that never worked, i.e., it ran bit it didn’t change the output to per epoch.

daavoo · July 5, 2022, 1:30pm

Yes
Ideally in the same working directory where the training script is launched

daavoo · July 5, 2022, 1:34pm

Do you mind opening an issue in GitHub - iterative/dvclive: 📈 Log and track ML metrics, parameters, models with Git and/or DVC ?

RiCk · July 5, 2022, 1:36pm

I’ll test a bit, and if it is still happening I will open an issue. No problem!

RiCk · July 6, 2022, 11:44am

@daavoo

I pass the DVC_CHECKPOINT using docker run -e DVC_CHECKPOINT=$DVC_CHECKPOINT, this works fine. Nothing changes.

Problem:

I add the .dvc folder to the “workspace”. The container opens in a folder called /workspace and I added the .dvc folder like so: -v $LOCAL_WORKSPACE_FOLDER/.dvc:/workspace/.dvc. And the training hangs. Container is still running but no output is generated as it stops right before the training should start. From the log file (output of mmcv, mmdet)

INFO:mmdet:workflow: [('train', 1)], max: 2 epochs
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!

The DvcliveLoggerHook is registered:

before_run:
(VERY_HIGH   ) StepLrUpdaterHook                  
(LOW         ) EvalHook                           
(VERY_LOW    ) DvcliveLoggerHook

It is also registered in before_train_epoch, after_train_iter (which is actually weird maybe as I said it should be per epoch), after_train_epoch, before_val_epoch, after_val_epoch

Maybe relevant: DVC does recognize it is a checkpoint experiment, but it almost immediately says the: “Updating lock file, To track changes etc., Experiment results have been applied to your workspace”. Before training has finished (or even started). I would expect something like the output in Checkpoints | Data Version Control · DVC.

Thanks for the help so far!

daavoo · July 6, 2022, 7:20pm

Hi @RiCk , thanks for the details. I will try to reproduce, debug and get back to you

RiCk · July 13, 2022, 8:15am

I made some progress regarding this issue and will update the post here such that it might be helpful to somebody. Specifically this comment in my last post before this one:

Maybe relevant: …, but it almost immediately says the: “Updating lock file, To track changes etc., Experiment results have been applied to your workspace”. Before training has finished (or even started). …

/bin/bash train.sh finished immediately, signaling to DVC that the training was done. However, the training wasn’t done, it was running in some other container. The scripts finished immediately because it was implemented that way. I changed it now such that the training is actually finished when the script finishes and this fixed a lot of my problems

I think I saw some features that might help with this in the future? Machine management/ Remote executors?

daavoo · July 22, 2022, 8:50am

Hi @RiCk ! sorry for the late reply.

will update the post here such that it might be helpful to somebody

Don’t hesitate on sharing

I changed it now such that the training is actually finished when the script finishes and this fixed a lot of my problems

Glad to hear that.

I think I saw some features that might help with this in the future? Machine management/ Remote executors?

There was some initial development on DVC regarding this topic but it is paused for now. Don’t hesitate on opening an issue or discussion to share your use case / needs.

Topic		Replies	Views
DVC and MLFlow - reproduce experiments using git commit ids Questions	14	5654	February 18, 2021
Track computer vision experiments in real-time with DVCLive in Iterative Studio Blog Comments	0	2545	January 30, 2023
ERROR: failed to reproduce 'prepare': [ERRNO 16] Device or resource busy: '/train/data/prepare' Questions	2	586	July 11, 2023
Can't use dvc.yaml in subfolders Questions	2	174	July 1, 2024
DVC Heartbeat - Discord gems Announcements	3	4165	June 27, 2019

DVClive + MMCV in Container

Related topics