Hi I’m running into problems using the DVClive hook for MMCV running in a separate “training” container.
I’m running my dvc exp run from within a devcontainer (on a remote host). The actual training is run on the host in another container (that has all the right dependencies installed). Within my config.py I included the DvcliveLoggerHook as per the documentation. I noticed the interval and by_epoch arguments where not applying. I.e., I wasn’t seeing any step 1, 2, 3 etc. (until max_epochs) in my dvc exp show.
I realized after some hours of reading the documentation and trying out different configs, that it might be because Live.set_step() is called from within a DvcliveLoggerHook running in one container and the DVC “instance” in another. So there is no DVC to notify in the container that is trying to notify DVC.
Could this be the problem? Is there a way to fix it?
One way to “fix”, i.e., to work around the problem, I can imagine is just to install everything in the devcontainer. This is doable for the current container I need, but I would rather not.
Hi @daavoo! cmd: /bin/bash train.sh where train.sh contains a docker run command linking some folders so data can be read and results can be written within the devcontainer. Including a command that is then executed in this container performing the actual training.
Pass DVC_CHECKPOINT set in the devcontainer as env variable to the container that does the training? docker run -e DVC_CHECKPOINT=$DVC_CHECKPOINT etc.
Mount the .dvc folder. I’m not sure where to mount the .dvc folder though. docker run -v .dvc:???
EDIT: Sorry read over your first comment. The results/dvclive/scalars do contain rows for interval/step. However, I tried to change the step to be epoch (by_epoch=True) and that never worked, i.e., it ran bit it didn’t change the output to per epoch.
I pass the DVC_CHECKPOINT using docker run -e DVC_CHECKPOINT=$DVC_CHECKPOINT, this works fine. Nothing changes.
I add the .dvc folder to the “workspace”. The container opens in a folder called /workspace and I added the .dvc folder like so: -v $LOCAL_WORKSPACE_FOLDER/.dvc:/workspace/.dvc. And the training hangs. Container is still running but no output is generated as it stops right before the training should start. From the log file (output of mmcv, mmdet)
INFO:mmdet:workflow: [('train', 1)], max: 2 epochs
loading annotations into memory...
loading annotations into memory...
It is also registered in before_train_epoch, after_train_iter (which is actually weird maybe as I said it should be per epoch), after_train_epoch, before_val_epoch, after_val_epoch
Maybe relevant: DVC does recognize it is a checkpoint experiment, but it almost immediately says the: “Updating lock file, To track changes etc., Experiment results have been applied to your workspace”. Before training has finished (or even started). I would expect something like the output in Checkpoints | Data Version Control · DVC.
I made some progress regarding this issue and will update the post here such that it might be helpful to somebody. Specifically this comment in my last post before this one:
Maybe relevant: …, but it almost immediately says the: “Updating lock file, To track changes etc., Experiment results have been applied to your workspace”. Before training has finished (or even started). …
/bin/bash train.sh finished immediately, signaling to DVC that the training was done. However, the training wasn’t done, it was running in some other container. The scripts finished immediately because it was implemented that way. I changed it now such that the training is actually finished when the script finishes and this fixed a lot of my problems
I think I saw some features that might help with this in the future? Machine management/ Remote executors?