I am running jobs on a cluster which has a login node and compute nodes. My dvc stages submit a job to the compute nodes and periodically check their status until they are done. These stages are run from the login node. The login node has a limit on how much cpu a process can use and the process will be cancelled if the process exceeds it. Since the stages are mostly just idle sleeping, they don’t exceed the cpu threshold, but once the stage finishes, DVC tries to compute the hashes of the outputs etc, and exceeds the cpu limit. To fix this, I want to manually trigger computation of the hashes on the compute nodes, i.e. as part of the job that is submitted. Is this possible? Is there a better way to get around this issue?
Have you tried to adjust the
like turning it down to 1 for example?
it looks like I have a hard 10 minute time limit on cpu time, which I’m pretty sure is multiplied by usage. e.g. using 100% of 4 cores for 1 minute is 4 minutes. Will reducing the number of checksum_jobs help with this?
I forgot to include a link: Configuration.
You may have to play with the values, but some users find that increasing/decreasing can impact compute time. This only adjusts the number of threads, not processes/cores, so it should only be using a single core no matter the value you set.
I can try that, thanks.
If there is too much data to checksum on a single core in 10 minutes, is there any other way for me to perform this computation on a different machine?
Thinking about it more though, why/how do you submit and run the job from the login node?
We are using PBS / qsub to manage the cluster. The typical workflow has been to ssh into the login node and submit jobs in the form of bash scripts to the scheduler via qsub. I have written a python script to template the job bash script, submit it to the scheduler, then wait until the job finishes before returning. This way I can replace
python <script> with
python submit_job.py <script> in my stages and everything works fine. This was okay until recently when we got a new dataset which is requiring too much computation for the checksums.
I am currently looking into submitting the individual stage jobs from a compute node.
I only have familiarity with SLURM, but from what I understand, they are similar at a high level. A successful pattern I have seen is to submit the entire pipeline as a job. In other words, run
dvc exp run inside your job script and keep all the individual stages as normal code like
python <script> (this also makes it easier to debug and run anywhere). You can template this pattern like you have done if you don’t want to manually edit the job submission script each time.
One way to share the repo state with the worker nodes is to commit and push changes to Git and DVC remotes, and your job can checkout that revision before running. After the experiment completes, you can run
dvc exp push to save the results back to your remote storage (or commit and push the results manually).
The solution that is currently working for me is submitting the
dvc repro <pipeline> command in a bash script as a job. I just make sure to request enough time for the entire pipeline to run.
Thanks for the help