Since dvc uses a checksum to determine if files have changed, would it be possible/is it possible to have dvc not change files’ modified times when 'add’ed?
Is there a reason I’m unaware of that the modified times are updated?
Thanks
Since dvc uses a checksum to determine if files have changed, would it be possible/is it possible to have dvc not change files’ modified times when 'add’ed?
Is there a reason I’m unaware of that the modified times are updated?
Thanks
On dvc add
, DVC does the equivalent of moving the file into the DVC cache directory and then creates the link (based on the DVC cache link type) from the original file to the DVC cache location. The exact behavior depends on the filesystem and cache link type you are using (i.e copy/reflink/symlink/hardlink), but in most cases this will result in changing the file modification time.
@pmrowla does it for certain link types preserve it though? I think I’ve seen an issue on GH / Discord that it was not preserving it for copy for example (what would be the reason to modify it in that case). Overall, even for copy/hardlink/reflink - it’s not clear why it has to be modified tbh (unless hardlinks / reflinks modify it when you create a new link an existing file).
Does depend on the order of file operation that we do on add
and other operations? Can we change that order?
For certain link types it will be preserved.
In particular, it should not change the modification time on most filesystems that support reflinks
on my machine (macos):
$ ls -l --full-time
total 1820
-rw-r--r-- 1 pmrowla staff 84 2023-08-07 18:37:12.016928390 +0900 dvc.yaml
-rw-r--r-- 1 pmrowla staff 1854891 2023-08-07 18:36:22.735606668 +0900 model.pkl
-rw------- 1 pmrowla staff 332 2023-08-07 18:38:12.811219246 +0900 tags
$ dvc add model.pkl
...
$ ls -l --full-time
total 1824
-rw-r--r-- 1 pmrowla staff 84 2023-08-07 18:37:12.016928390 +0900 dvc.yaml
-rw-r--r-- 1 pmrowla staff 1854891 2023-08-07 18:36:22.735606668 +0900 model.pkl
-rw-r--r-- 1 pmrowla staff 92 2023-08-11 12:35:09.304201246 +0900 model.pkl.dvc
-rw------- 1 pmrowla staff 332 2023-08-07 18:38:12.811219246 +0900 tags
(model.pkl
mtime has not changed)
I’m using Debian. These files are stored on a zfs filesystem, and I’ve seen the same behavior when I was using btrfs.
Hm. Just realized I’m adding the folder (for convenience, since there are often hundreds of files changed/added/deleted), rather than individual files. Could that be the difference?
(I also noticed when I dvc add data
, it says “Checking out …/data”, while it enumerates all the files. The “Checking out” always seems odd, as the files are already there, and I’m adding.)
Here’s a before and after:
Sleep on cnn-rnn [!]
❯ ll data/stored_models/{model*,chkpt*,best*} -d
drwxr-xr-x 4 john john 7 Aug 11 09:14 data/stored_models/best_mc_model
-rw-r--r-- 1 john john 865 Aug 11 09:15 data/stored_models/best_mc_scaler.joblib
drwxr-xr-x 4 john john 7 Aug 11 08:52 data/stored_models/best_n2_model
-rw-r--r-- 1 john john 865 Aug 11 08:45 data/stored_models/best_n2_scaler.joblib
drwxr-xr-x 4 john john 7 Aug 11 08:53 data/stored_models/chkpt_n2_1
drwxr-xr-x 4 john john 7 Aug 11 08:57 data/stored_models/chkpt_n2_2
drwxr-xr-x 4 john john 7 Aug 11 08:55 data/stored_models/chkpt_n2_3
...
❯ dvc add data
100% Adding...|█████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [15:02, 902.64s/file]
Sleep on cnn-rnn [!?⇡] took 15m3s
❯ ll data/stored_models/{model*,chkpt*,best*} -d
drwxr-xr-x 4 john john 7 Aug 11 10:01 data/stored_models/best_mc_model
-rw-r--r-- 1 john john 865 Aug 11 09:49 data/stored_models/best_mc_scaler.joblib
drwxr-xr-x 4 john john 7 Aug 11 10:00 data/stored_models/best_n2_model
-rw-r--r-- 1 john john 865 Aug 11 09:47 data/stored_models/best_n2_scaler.joblib
drwxr-xr-x 4 john john 7 Aug 11 09:56 data/stored_models/chkpt_n2_1
drwxr-xr-x 4 john john 7 Aug 11 09:59 data/stored_models/chkpt_n2_2
drwxr-xr-x 4 john john 7 Aug 11 09:56 data/stored_models/chkpt_n2_3
ZFS on Linux does not support reflinks (so I would not expect unmodified mtimes), see COW cp (--reflink) support · Issue #405 · openzfs/zfs · GitHub (support was finally merged last month but it has not yet been released, and I would assume it will be a while until it is available in the standard debian package repos)
If you have the option of using either btrfs or zfs on linux with DVC, it would be better to go with btrfs at the moment
With regard to directories, in older releases DVC would remove the entire directory and then re-link/re-checkout all of the files in that dir (causing the mtime change). In the latest release DVC should no longer be doing this this (so unmodified files will not get re-linked).
(I also noticed when I
dvc add data
, it says “Checking out …/data”, while it enumerates all the files. The “Checking out” always seems odd, as the files are already there, and I’m adding.)
As noted earlier, dvc add
moves added files to cache and creates the link from the original file location to the cache, the checkout is the “link creation” step. So internally in DVC, add
is considered a move to cache and then checkout. (And the “move” is actually dependent on cache link type, it may be a hardlink or reflink operation and not necessarily a filesystem move or copy)