Hi,
First of all, I love “dvc”! It is so useful for my project!
Thank you very much for providing it to me.
I have some questions about temporary files created by “dvc add --out” command. It would be appreciated if you help me.
I run the following command to add /outside/project/directory/data.
$ dvc add --out mydata /outside/project/directory/data
This command seems to upload files in /outside/project/directory/data to dvc cache directory like this. (here, /my/dvc_cache/ is my dvc cache directory)
/my/dvc_cache/files/md5/.XIt3dn31aCwR83UzvYTYiQ.tmp
/my/dvc_cache/files/md5/.mJmybe_ATBDMHb51Gt-51w.tmp
/my/dvc_cache/files/md5/.oRPlNy5uS-vKYXyAiXl6TQ.tmp
/my/dvc_cache/files/md5/.zbcU6YNlS4piRJMv-EmE4w.tmp
Here are my questions :
Is there any way to clean up these files automatically when “dvc add” is completed?
Or is there any way to clean up these files safely? Can “dvc gc” command be used for this purpose?
Please let me know.
Thank you very much for your cooperation in advance.
Best Regards,
Kaz DEGUCHI
dvc is supposed to clean that up when the operation is completed. I see that it’s a bug in add --out
here:
from_path: "AnyFSPath",
fs: "FileSystem",
odb: "HashFileDB",
upload_odb: "ObjectDB",
callback: Optional[Callback] = None,
) -> tuple[Meta, HashFile]:
from dvc_objects.fs.utils import tmp_fname
from .hash import HashStreamFile
tmp_info = upload_odb.fs.join(upload_odb.path, tmp_fname())
with fs.open(from_path, mode="rb") as stream:
hashed_stream = HashStreamFile(stream)
size = fs.size(from_path)
cb = callback or TqdmCallback(
desc=upload_odb.fs.name(from_path),
bytes=True,
size=size,
)
with cb:
fileobj = cast("BinaryIO", hashed_stream)
I’ll try to fix it soon. (I’d encourage you to open an issue in Create new issue - iterative/dvc though).
It’s safe to remove those files. Unfortunately, dvc gc
command does not clean those up.
I’ve just create an issue about this:
opened 08:12AM - 19 Jun 25 UTC
# Bug Report
add: doesn't clean up temporary files when --out is given
## De… scription
In the following command execution, "add" command uploads files in /outside/project/directory/data to dvc cache area.
```
$ dvc add --out mydata /outside/project/directory/data
```
"dvc add --out" should clean up these uploaded files when the oeration is completed, but it doesn't.
### Reproduce
1. Create some files in /outside/project/directory/data
2. dvc add --out mydata /outside/project/directory/data
3. Check your dvc cache directory.
### Expected
According to the forum disccussion, dvc add --out should clean them up when the oeration is completed, but it doesn't.
Please see also the following discussion.
* https://discuss.dvc.org/t/how-to-remove-tmp-files-uploaded-by-dvc-add-out/2543/2
### Environment information
**Output of `dvc doctor`:**
```console
$ dvc doctor
DVC version: 3.60.1 (pip)
-------------------------
Platform: Python 3.10.18 on Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.17
Subprojects:
dvc_data = 3.16.10
dvc_objects = 5.0.0
dvc_render = 1.0.1
dvc_task = 0.3.0
scmrepo = 3.3.11
Supports:
http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3)
Config:
Global: /h/ca1/kdeguchi/.config/dvc
System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: nfs on fsbeci09_1:/vol_harpstar_lib01
Caches: local
Remotes: local
Workspace directory: nfs on fsbeci09_1:/vol_harpstar_lib01
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/6cc7d6bb3d29e3e134638eec821654d8
```
**Additional Information (if any):**
None
Thank you very much for your help in advance.
Best Regards,
Kaz DEGUCHI
1 Like