How to add data from "scratch" filesystem directly to "scratch" cache dir

marius · March 15, 2021, 8:13pm

My scenario is that I’m on a cluster with really limited home directory quota (40GB) but with TB of “scratch” space on a separate filesystem. I’d like to keep my Git repo on the home filesystem (say at ~/myrepo/), but the DVC cache on the scratch file system (say at /scratch/dvc).

Its easy enough to configure the cache on the scratch filesystem with e.g. dvc cache dir /scratch/dvc and then use the symlink cache type so files just symlink to here.

But now, suppose I’ve generated a large file at /scratch/newfile.dat. How can I add it to the repo at a chosen path without it ever touching the home filesystem?

I would have thought:

dvc add -o ~/myrepo/newfile.dat /scratch/newfile.dat

would do it but that still seems to go through the home filesystem.

It seems if I add --to-remote it works as expected, but it also (obviously) pushes to remote, which I don’t want.

Thanks for any suggestions.

jorgeorpinel · March 15, 2021, 10:40pm

Hi @marius,

Glad you were able to find that info already That’s the first step indeed.

If you’re using DVC 2.x, then that exact command is what you need. It should transfer the data to the cache without needing much space available in the local system (it uses file chunking on as an intermediate step).

See also add

Thanks.

jorgeorpinel · March 15, 2021, 10:41pm

Not sure about the absolute path you’re providing to -o though, please use a relative path from your project directory if needed, i.e.

$ cd ~/myrepo
$ dvc add /scratch/newfile.dat -o newfile.dat

marius · March 15, 2021, 11:48pm

Ahhh I see, that piece of info is key, thanks (and I am on 2.x). The reason I thought the home filesystem was being touched was because the write speed was kind of slow, indicative of having gone scratch → home → scratch, as compared to a straight scratch → scratch copy which on this system is much faster. So I suppose that’s just because the data does pass through the home filesystem, but it does so in chunks so it shouldn’t overwhelm my quota. I guess a question I have is why not a direct copy, but in any case this is very helpful and solves my major worry about the quota. Thanks!

jorgeorpinel · March 16, 2021, 2:20am

Yes that would be ideal. I’ll check with engineering on this… Thanks for the feedback!

jorgeorpinel · March 16, 2021, 6:20am

Another quick clarification, BTW: the transfer doesn’t actually use any local disk space at all, as the chunks are stored locally in memory only.

jorgeorpinel · March 16, 2021, 6:31am

Finally, I did check and for now DVC can’t support direct transfers from external data locations to remote storage (without in-memory chunking locally) because we need to calculate the md5 hash of the data (which can only be done locally) in order to dvc add it properly. Maybe for DVC 3.0 though, we’ll see! And we still welcome any feature requests on our Github repo.

Thanks again!

marius · March 17, 2021, 1:07am

Hmm I suppose then there’s something I’m not understanding about what was happening, will look more, but in any case the info here is very helpful, thanks!

jorgeorpinel · March 17, 2021, 1:56am

Well, it still needs to download and upload each chunk (into memory) so the transfer still gets bottlenecked locally

marius · March 17, 2021, 2:13am

To clarify, “scratch” and “home” filesytems I’m referring to here are two different network filesystems both mounted on the same machine. As far as DVC sees, there are only file copies, no upload/download. It happens that scratch <-> scratch is fast scratch <-> home is slow, the bottleneck I think being home disk and/or network speed, so I don’t think its an issue that data passes through memory (which in this configuration to me seems unavoidable to copy from one filesystem to another).

jorgeorpinel · March 17, 2021, 5:50am

Ah yes good point. I guess still in this case the chunked transfer with Python-level i/o implemented in DVC may still be significantly slower than a native file system call i.e. cp.

Topic		Replies	Views
Dealing with large datasets and file quotas Questions	2	261	February 5, 2024
First steps with DVC, a few questions Questions	2	68	September 20, 2024
Shared cache directory Questions	14	3148	July 5, 2018
How cache is maintained for big data size locally Questions	4	1230	May 15, 2020
Manage data with DVC in-repo, but place the data upon pull outside-repo? Questions	3	237	December 13, 2024

How to add data from "scratch" filesystem directly to "scratch" cache dir

Related topics