initial issue (dead experiments, missing names)
├── 2f2a178 Feb 18, 2023 Running - 0.70166 0.83538 0.42834 46791 0.15652 185381
├── 9dc97e9 Feb 18, 2023 Running - 0.70166 0.83538 0.42834 46791 0.15652 185381
├── 65a6585 Feb 18, 2023 Running dvc-task 0.70166 0.83538 0.42834 46791 0.15652 185381
├── e5c1060 Feb 18, 2023 Running - 0.70166 0.83538 0.42834 46791 0.15652 185381
├── 9aa34d4 [exp_50] Feb 18, 2023 Queued - - - - - - -
└── f4c9181 [exp_51] Feb 18, 2023 Queued -
Above is what dvc exp show
looks like. When I call dvc queue kill 2f2a178
for example, I get:
ERROR: '2f2a178' is not a valid queued experiment name
I ran the experiments from vscode and set the number of jobs to 4. At this point it looks like DVC only thinks 1 is running? However none of them are actually running. dvc queue status
is too slow to even run on the computer, it gets cancelled by the OS or something. See Dvc exp show, dvc queue status execution time - #6 by gregstarr. Where did the experiment names go? When I check for running processes, there is only a single process related to dvc:
starrgw1 256549 0.0 0.0 568940 21844 ? SNl 11:37 0:00 [...]/.vscode-server/bin/[...]/node [...]/.vscode-server/extensions/iterative.dvc-0.6.10/dist/node_modules/dvc-vscode-lsp/dist/server.js --node-ipc --clientProcessId=256403
Usually when multiple experiments are running there are a bunch of dvc-related processes.
Here is my dvc doctor:
DVC version: 2.41.1 (pip)
---------------------------------
Platform: Python 3.9.11 on Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
Subprojects:
dvc_data = 0.29.0
dvc_objects = 0.14.1
dvc_render = 0.0.17
dvc_task = 0.1.9
dvclive = 1.3.2
scmrepo = 0.1.5
Supports:
http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: symlink
Cache directory: lustre on 192.168.199.212@o2ib:192.168.199.213@o2ib:/scratch
Caches: local
Remotes: local
Workspace directory: nfs on master:/home
Repo: dvc, git
other issue (dvc slowness)
I tried to downgrade to dvc 2.9.2 to avoid the slowness and got this error when running dvc exp show
Traceback (most recent call last):
File "/home/starrgw1/.local/lib/python3.8/site-packages/dvc/main.py", line 54, in main
cmd = args.func(args)
File "/home/starrgw1/.local/lib/python3.8/site-packages/dvc/command/base.py", line 35, in __init__
from dvc.repo import Repo
File "/home/starrgw1/.local/lib/python3.8/site-packages/dvc/repo/__init__.py", line 13, in <module>
from dvc.ignore import DvcIgnoreFilter
File "/home/starrgw1/.local/lib/python3.8/site-packages/dvc/ignore.py", line 10, in <module>
from dvc.fs.base import FileSystem
File "/home/starrgw1/.local/lib/python3.8/site-packages/dvc/fs/__init__.py", line 5, in <module>
from .azure import AzureFileSystem
File "/home/starrgw1/.local/lib/python3.8/site-packages/dvc/fs/azure.py", line 5, in <module>
from fsspec.asyn import fsspec_loop
ImportError: cannot import name 'fsspec_loop' from 'fsspec.asyn' (/home/starrgw1/.local/lib/python3.8/site-packages/fsspec/asyn.py)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/starrgw1/.local/bin/dvc", line 8, in <module>
sys.exit(main())
File "/home/starrgw1/.local/lib/python3.8/site-packages/dvc/main.py", line 84, in main
from dvc.info import get_dvc_info
File "/home/starrgw1/.local/lib/python3.8/site-packages/dvc/info.py", line 10, in <module>
from dvc.fs import FS_MAP, get_fs_cls, get_fs_config
File "/home/starrgw1/.local/lib/python3.8/site-packages/dvc/fs/__init__.py", line 5, in <module>
from .azure import AzureFileSystem
File "/home/starrgw1/.local/lib/python3.8/site-packages/dvc/fs/azure.py", line 5, in <module>
from fsspec.asyn import fsspec_loop
ImportError: cannot import name 'fsspec_loop' from 'fsspec.asyn' (/home/starrgw1/.local/lib/python3.8/site-packages/fsspec/asyn.py)