How long should dvc queue status and dvc exp show take to run? Dvc queue status is going on ten minutes now for 20 experiments or so. Earlier it took 5 minutes for dvc exp show. These seem like they should be quick commands like git status which I spam constantly.
Unfortunately this is a known issue, the performance for all 3 commands are affected due to how we collect/store information about the queued experiments. This is being actively worked on and will hopefully be resolved soon, but there aren’t any workarounds at the moment.
Is there an older version of DVC which doesn’t have this issue? On a different project I have been using 1.16 or 1.11 (I don’t remember) which seems to be quick. Is there a 2.x version of DVC which doesn’t have the issue?
You could use 2.13.0 or earlier, but those releases were prior to dvc queue related commands being introduced in DVC. In 2.13.0 and earlier there was a naive implementation for dvc exp run --queue/--run-all which would allow you to run multiple experiments, but was not really a functional task queue.
(There is no dvc queue command in those releases, so dvc queue status and dvc queue logs are unavailable)
I just ran a few experiments and I had to cancel some of them. Now when I run any command relating to experiments, it just sits there and does nothing. This includes dvc exp show, dvc exp list, dvc queue status. I am on v3.33.4.
I was about to post the output of dvc doctor but now it seems stuck too.
(almds) <login01>~/code/almds_prototype[smaller-ecc]$ dvc doctor -vv
2023-12-26 12:31:25,454 DEBUG: v3.33.4 (pip), CPython 3.10.13 on Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
2023-12-26 12:31:25,454 DEBUG: command: /home/starrgw1/.conda/envs/almds/bin/dvc doctor -vv
2023-12-26 12:31:25,454 TRACE: Namespace(quiet=0, verbose=2, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='doctor', func=<class 'dvc.commands.version.CmdVersion'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2023-12-26 12:31:26,011 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-12-26 12:31:26,012 DEBUG: Removing '/home/starrgw1/code/.ksb2t94cSqwG95Rtj4TEeh.tmp'
2023-12-26 12:31:26,012 DEBUG: link type hardlink is not available ([Errno 95] no more link types left to try out)
2023-12-26 12:31:26,012 DEBUG: Removing '/home/starrgw1/code/.ksb2t94cSqwG95Rtj4TEeh.tmp'
2023-12-26 12:31:26,012 DEBUG: Removing '/home/starrgw1/code/.ksb2t94cSqwG95Rtj4TEeh.tmp'
2023-12-26 12:31:26,024 DEBUG: Removing '/scratch/tmp/starrgw1/almds/dvc_cache/files/md5/.6kKKYQ3VnP43w9wAHHKzdg.tmp'
It’s hard to tell what the issue might be here. What was the command you used to cancel your experiment runs?
There may still be a hanging DVC or DVC-related Python process left over from the experiments you cancelled, you will probably need to check for this (i.e. with ps) and if there are any you may need to force quit them with kill or kill -9 (assuming you were using queued experiments you can also retry doing dvc queue stop --kill to forcefully stop any leftover queue processes)