In our project we sometimes have long repros (with e.g., hyperparameter optimization) that can take hours/days to finish. While such a repro is running, I can’t change branches because:
- The repro reads files as the stages are being executed, so changing branches mid-repro could make subsequent stages use file versions from a different branch;
- We have commit hooks that use dvc, which complains about the lockfile if I try to change branches during a repro.
Right now I see two solutions to be able to continue working while a repro is running:
- Run the repro in another machine (e.g., cloud)
- Create an entire copy of my repo in another folder, and work there.
Solution 1 is bad because I have to waste resources to keep the cloud instance running, and because running it remotely is cumbersome (I don’t currently have a system to notify me if the repro fails, for instance, and have to keep checking, and I don’t have the same tools in the remote to investigate outputs, monitor resource usage, etc).
Solution 2 is bad because I need twice as much data stored in my computer (for which I usually don’t have enough space for, it’s a big project), and because more than once I ended up working on the wrong repo, messing up the repro and having to move my changes to the other version of the repo.
Are there any recommended approaches here? This seems like a common problem that most users with long-running repros would face.