Manually recompute hashes

gregstarr · October 31, 2023, 7:10pm

Hello,

I am running jobs on a cluster which has a login node and compute nodes. My dvc stages submit a job to the compute nodes and periodically check their status until they are done. These stages are run from the login node. The login node has a limit on how much cpu a process can use and the process will be cancelled if the process exceeds it. Since the stages are mostly just idle sleeping, they don’t exceed the cpu threshold, but once the stage finishes, DVC tries to compute the hashes of the outputs etc, and exceeds the cpu limit. To fix this, I want to manually trigger computation of the hashes on the compute nodes, i.e. as part of the job that is submitted. Is this possible? Is there a better way to get around this issue?

Thanks

dberenbaum · October 31, 2023, 7:21pm

Have you tried to adjust the core.checksum_jobs value?

gregstarr · October 31, 2023, 7:24pm

like turning it down to 1 for example?

gregstarr · October 31, 2023, 7:49pm

it looks like I have a hard 10 minute time limit on cpu time, which I’m pretty sure is multiplied by usage. e.g. using 100% of 4 cores for 1 minute is 4 minutes. Will reducing the number of checksum_jobs help with this?

dberenbaum · October 31, 2023, 7:56pm

I forgot to include a link: Configuration.

You may have to play with the values, but some users find that increasing/decreasing can impact compute time. This only adjusts the number of threads, not processes/cores, so it should only be using a single core no matter the value you set.

gregstarr · October 31, 2023, 8:01pm

I can try that, thanks.

If there is too much data to checksum on a single core in 10 minutes, is there any other way for me to perform this computation on a different machine?

dberenbaum · October 31, 2023, 8:02pm

Thinking about it more though, why/how do you submit and run the job from the login node?

gregstarr · October 31, 2023, 8:20pm

We are using PBS / qsub to manage the cluster. The typical workflow has been to ssh into the login node and submit jobs in the form of bash scripts to the scheduler via qsub. I have written a python script to template the job bash script, submit it to the scheduler, then wait until the job finishes before returning. This way I can replace python <script> with python submit_job.py <script> in my stages and everything works fine. This was okay until recently when we got a new dataset which is requiring too much computation for the checksums.

I am currently looking into submitting the individual stage jobs from a compute node.

dberenbaum · November 1, 2023, 4:29pm

I only have familiarity with SLURM, but from what I understand, they are similar at a high level. A successful pattern I have seen is to submit the entire pipeline as a job. In other words, run dvc exp run inside your job script and keep all the individual stages as normal code like python <script> (this also makes it easier to debug and run anywhere). You can template this pattern like you have done if you don’t want to manually edit the job submission script each time.

One way to share the repo state with the worker nodes is to commit and push changes to Git and DVC remotes, and your job can checkout that revision before running. After the experiment completes, you can run dvc exp push to save the results back to your remote storage (or commit and push the results manually).

gregstarr · November 3, 2023, 9:55pm

The solution that is currently working for me is submitting the dvc repro <pipeline> command in a bash script as a job. I just make sure to request enough time for the entire pipeline to run.

Thanks for the help

Topic		Replies	Views
New personal best for number of times a file is rehashed "only once" Questions	1	238	September 22, 2023
Modification during pipeline execution Questions	0	6	November 5, 2024
Workflow on slurm-like clusters Questions	4	2229	September 22, 2020
DVC-hash and PyTorch files Questions	2	21	June 10, 2025
Why does dvc pull calculate its own MD5 hash? Feature Requests	2	2129	October 26, 2019

Manually recompute hashes

Related topics