Dvc exp show, dvc queue status execution time

Hello,

How long should dvc queue status and dvc exp show take to run? Dvc queue status is going on ten minutes now for 20 experiments or so. Earlier it took 5 minutes for dvc exp show. These seem like they should be quick commands like git status which I spam constantly.

Any tips for how to speed this up?

Thanks,
Greg

Dvc queue logs same issue

Unfortunately this is a known issue, the performance for all 3 commands are affected due to how we collect/store information about the queued experiments. This is being actively worked on and will hopefully be resolved soon, but there aren’t any workarounds at the moment.

You can follow these issues for updates:

1 Like

Is there an older version of DVC which doesn’t have this issue? On a different project I have been using 1.16 or 1.11 (I don’t remember) which seems to be quick. Is there a 2.x version of DVC which doesn’t have the issue?

You could use 2.13.0 or earlier, but those releases were prior to dvc queue related commands being introduced in DVC. In 2.13.0 and earlier there was a naive implementation for dvc exp run --queue/--run-all which would allow you to run multiple experiments, but was not really a functional task queue.

(There is no dvc queue command in those releases, so dvc queue status and dvc queue logs are unavailable)

Okay thanks for the advise. That makes sense because I started seeing the issue when upgrading from 2.9.2 to >= 2.30

Hello,

I just ran a few experiments and I had to cancel some of them. Now when I run any command relating to experiments, it just sits there and does nothing. This includes dvc exp show, dvc exp list, dvc queue status. I am on v3.33.4.

I was about to post the output of dvc doctor but now it seems stuck too.

(almds) <login01>~/code/almds_prototype[smaller-ecc]$ dvc doctor -vv
2023-12-26 12:31:25,454 DEBUG: v3.33.4 (pip), CPython 3.10.13 on Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
2023-12-26 12:31:25,454 DEBUG: command: /home/starrgw1/.conda/envs/almds/bin/dvc doctor -vv
2023-12-26 12:31:25,454 TRACE: Namespace(quiet=0, verbose=2, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='doctor', func=<class 'dvc.commands.version.CmdVersion'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2023-12-26 12:31:26,011 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-12-26 12:31:26,012 DEBUG: Removing '/home/starrgw1/code/.ksb2t94cSqwG95Rtj4TEeh.tmp'
2023-12-26 12:31:26,012 DEBUG: link type hardlink is not available ([Errno 95] no more link types left to try out)
2023-12-26 12:31:26,012 DEBUG: Removing '/home/starrgw1/code/.ksb2t94cSqwG95Rtj4TEeh.tmp'
2023-12-26 12:31:26,012 DEBUG: Removing '/home/starrgw1/code/.ksb2t94cSqwG95Rtj4TEeh.tmp'
2023-12-26 12:31:26,024 DEBUG: Removing '/scratch/tmp/starrgw1/almds/dvc_cache/files/md5/.6kKKYQ3VnP43w9wAHHKzdg.tmp'

and it has been stuck here for 5 minutes or so.

It’s hard to tell what the issue might be here. What was the command you used to cancel your experiment runs?

There may still be a hanging DVC or DVC-related Python process left over from the experiments you cancelled, you will probably need to check for this (i.e. with ps) and if there are any you may need to force quit them with kill or kill -9 (assuming you were using queued experiments you can also retry doing dvc queue stop --kill to forcefully stop any leftover queue processes)

I think this may have been an issue with our cluster or NAS. I will update if the issue returns