DVC Status in vscode

I’m using vscode with DVC. I’m confused by what’s being displayed in the SOURCE CONTROL sidebar

There’s the standard git section which I understand, and below that, there’s a section for DVC

It’s showing all my model artifacts in a section saying Committed with D (deleted) beside them after I’ve run and experiment and done a dvc push. The same files are also showing as deleted when I use the explorer sidebar (except they’re there as symlinks to the cache)

image

This is the vscode file explorer view

image

Do those files exist locally? Could you please post dvc doctor output and dvc data status --granular --untracked-files --unchanged -vv? Thanks.

Yes they exist locally, ie they can be opened in vscode and are valid symlinks to files in the dvc cache.

dvc status is showing as deleted. This happened around the time I updated dvc (via apt) to the latest version, and added a stage to my dvc.yaml. Originally I only had one stage producing models/model.pt, models/text_transform.pt and models/index_to_name.json. I then added preceding stage to create models/tokenizer.json and models/char_tokenizer.json in the same folder)

> dvc data status --granular --untracked-files --unchanged -vv

...
DVC committed changes:
  (git commit the corresponding dvc files to update the repo)
        deleted: models/model.pt
        deleted: models/tokenizer.json
        deleted: models/char_tokenizer.json
        deleted: models/text_transform.pt
        deleted: models/index_to_name.json

DVC unchanged files:
   ...
> dvc doctor

DVC version: 2.44.0 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-5.15.0-60-generic-x86_64-with-glibc2.35
Subprojects:

Supports:
        azure (adlfs = 2023.1.0, knack = 0.10.1, azure-identity = 1.12.0),
        gdrive (pydrive2 = 1.15.0),
        gs (gcsfs = 2023.1.0),
        hdfs (fsspec = 2023.1.0, pyarrow = 11.0.0),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = 2023.1.0, boto3 = 1.24.59),
        ssh (sshfs = 2023.1.0),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2023.1.0)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git

Also on my second machine, I did a git sync and then dvc pull, it’s now showing the same thing, i.e.

$ dvc pull

M data/datasets/examples/
A models/tokenizer.json
A models/char_tokenizer.json
A models/model.pt
A models/text_transform.pt
A models/index_to_name.json
A models/checkpoints/
D plots/
D models/

Of note is it pulled the latest model files (they’re now on the local machine, in particular, the tokenizer files were never on this machine as I only just added that stage) but then marked them as deleted. Note: deletion of plots is correct - I did remove that folder on the other machine.

I’m not sure how these files are tracked. They’re model artifacts, there are no models.dvc file like the input dataset etc.

Where are they tracked? Are those tracked in .dvc files or dvc.yaml/dvc.lock file? This may happen if .dvc or dvc.lock files are deleted. Is that the case?

What does git status show?

These files are not tracked by a .dvc file, they’re in the dvc.yaml /dvc.lock file.

One thing I noticed is when I tried to commit these files it says "ERROR: failed to commit - unable to commit changed stage: ‘train’.

When I use the force option I noticed that it’s modified one of the deps for the train stage, specifically the folder that contains a python package.

- path: src/multiclass_classifier
  md5: 6e7721a0f99f31a34f8ff87509039e37.dir
  size: 53640
  nfiles: 39

- path: src/multiclass_classifier
  md5: d81406ab21bfec100d1669268bc31122.dir
  size: 52541
  nfiles: 38

I’m not sure what file(s) have changed - it doesn’t seem to be a source file or I git would show a change.

Is there a better way of setting a stage dependency to a package (installed using pip -e .)?

Also I decided to recreate my dvc.ymal' file using dvc stage add

One thing I noticed was when I tried to recreate the first stage

$ dvc stage add -n tokenizer -d data/datasets/raw/dataset.csv -p tokenizer.dataset -p tokenizer.vocab_size -o models/tokenizer.json -o models/char_tokenizer.json src/scripts/tokenizer.py

ERROR: Output(s) outside of DVC project: models/tokenizer.json, models/char_tokenizer.json. See <https://dvc.org/doc/user-guide/managing-external-data> for more info.

I ended up creating a new folder called model instead. I then got warned that metrics.json couldn’t be added to the train stage as it was tracked by git. I fixed this.

I’m still seeing messages that files are deleted - it’s not the same files now. It’s the outputs from stage 1 (which are dependencies of stage 2). All of these files are being saved into the same folder (models).

It’s also now reporting that metrics.json is deleted

I decided to start again, I removed the .dvc folder and created a new s3 bucket for the remote and re-initialised the repo and ran dvc add again.

I did notice that at least one .dvc file changed when I ran dvc add - implying that they were somehow out of sync. In particular, the actual number of tracked files changed when I ran dvc add and compared the old and new .dvc file

Seems to be working now