I’m using vscode with DVC. I’m confused by what’s being displayed in the SOURCE CONTROL sidebar
There’s the standard git section which I understand, and below that, there’s a section for DVC
It’s showing all my model artifacts in a section saying Committed with D (deleted) beside them after I’ve run and experiment and done a dvc push. The same files are also showing as deleted when I use the explorer sidebar (except they’re there as symlinks to the cache)
This is the vscode file explorer view
Do those files exist locally? Could you please post
dvc doctor output and
dvc data status --granular --untracked-files --unchanged -vv? Thanks.
Yes they exist locally, ie they can be opened in vscode and are valid symlinks to files in the dvc cache.
dvc status is showing as deleted. This happened around the time I updated dvc (via apt) to the latest version, and added a stage to my dvc.yaml. Originally I only had one stage producing models/model.pt, models/text_transform.pt and models/index_to_name.json. I then added preceding stage to create models/tokenizer.json and models/char_tokenizer.json in the same folder)
> dvc data status --granular --untracked-files --unchanged -vv
DVC committed changes:
(git commit the corresponding dvc files to update the repo)
DVC unchanged files:
> dvc doctor
DVC version: 2.44.0 (deb)
Platform: Python 3.10.8 on Linux-5.15.0-60-generic-x86_64-with-glibc2.35
azure (adlfs = 2023.1.0, knack = 0.10.1, azure-identity = 1.12.0),
gdrive (pydrive2 = 1.15.0),
gs (gcsfs = 2023.1.0),
hdfs (fsspec = 2023.1.0, pyarrow = 11.0.0),
http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
oss (ossfs = 2021.8.0),
s3 (s3fs = 2023.1.0, boto3 = 1.24.59),
ssh (sshfs = 2023.1.0),
webdav (webdav4 = 0.9.8),
webdavs (webdav4 = 0.9.8),
webhdfs (fsspec = 2023.1.0)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
Also on my second machine, I did a git sync and then dvc pull, it’s now showing the same thing, i.e.
$ dvc pull
Of note is it pulled the latest model files (they’re now on the local machine, in particular, the tokenizer files were never on this machine as I only just added that stage) but then marked them as deleted. Note: deletion of plots is correct - I did remove that folder on the other machine.
I’m not sure how these files are tracked. They’re model artifacts, there are no models.dvc file like the input dataset etc.
Where are they tracked? Are those tracked in
.dvc files or
dvc.lock file? This may happen if
dvc.lock files are deleted. Is that the case?
git status show?
These files are not tracked by a
.dvc file, they’re in the
One thing I noticed is when I tried to commit these files it says "ERROR: failed to commit - unable to commit changed stage: ‘train’.
When I use the
force option I noticed that it’s modified one of the deps for the train stage, specifically the folder that contains a python package.
- path: src/multiclass_classifier
- path: src/multiclass_classifier
I’m not sure what file(s) have changed - it doesn’t seem to be a source file or I git would show a change.
Is there a better way of setting a stage dependency to a package (installed using pip -e .)?
Also I decided to recreate my
dvc.ymal' file using
dvc stage add
One thing I noticed was when I tried to recreate the first stage
$ dvc stage add -n tokenizer -d data/datasets/raw/dataset.csv -p tokenizer.dataset -p tokenizer.vocab_size -o models/tokenizer.json -o models/char_tokenizer.json src/scripts/tokenizer.py
ERROR: Output(s) outside of DVC project: models/tokenizer.json, models/char_tokenizer.json. See <https://dvc.org/doc/user-guide/managing-external-data> for more info.
I ended up creating a new folder called
model instead. I then got warned that metrics.json couldn’t be added to the train stage as it was tracked by git. I fixed this.
I’m still seeing messages that files are deleted - it’s not the same files now. It’s the outputs from stage 1 (which are dependencies of stage 2). All of these files are being saved into the same folder (models).
It’s also now reporting that metrics.json is deleted
I decided to start again, I removed the .dvc folder and created a new s3 bucket for the remote and re-initialised the repo and ran dvc add again.
I did notice that at least one .dvc file changed when I ran dvc add - implying that they were somehow out of sync. In particular, the actual number of tracked files changed when I ran dvc add and compared the old and new .dvc file
Seems to be working now