How to add data from "scratch" filesystem directly to "scratch" cache dir

My scenario is that I’m on a cluster with really limited home directory quota (40GB) but with TB of “scratch” space on a separate filesystem. I’d like to keep my Git repo on the home filesystem (say at ~/myrepo/), but the DVC cache on the scratch file system (say at /scratch/dvc).

Its easy enough to configure the cache on the scratch filesystem with e.g. dvc cache dir /scratch/dvc and then use the symlink cache type so files just symlink to here.

But now, suppose I’ve generated a large file at /scratch/newfile.dat. How can I add it to the repo at a chosen path without it ever touching the home filesystem?

I would have thought:

dvc add -o ~/myrepo/newfile.dat /scratch/newfile.dat

would do it but that still seems to go through the home filesystem.

It seems if I add --to-remote it works as expected, but it also (obviously) pushes to remote, which I don’t want.

Thanks for any suggestions.

Hi @marius,

Glad you were able to find that info already :+1: That’s the first step indeed.

If you’re using DVC 2.x, then that exact command is what you need. It should transfer the data to the cache without needing much space available in the local system (it uses file chunking on as an intermediate step).

See also add

Thanks.

1 Like

Not sure about the absolute path you’re providing to -o though, please use a relative path from your project directory if needed, i.e.

$ cd ~/myrepo
$ dvc add /scratch/newfile.dat -o newfile.dat
1 Like

Ahhh I see, that piece of info is key, thanks (and I am on 2.x). The reason I thought the home filesystem was being touched was because the write speed was kind of slow, indicative of having gone scratch → home → scratch, as compared to a straight scratch → scratch copy which on this system is much faster. So I suppose that’s just because the data does pass through the home filesystem, but it does so in chunks so it shouldn’t overwhelm my quota. I guess a question I have is why not a direct copy, but in any case this is very helpful and solves my major worry about the quota. Thanks!

1 Like

Yes that would be ideal. I’ll check with engineering on this… Thanks for the feedback!

Another quick clarification, BTW: the transfer doesn’t actually use any local disk space at all, as the chunks are stored locally in memory only.

Finally, I did check and for now DVC can’t support direct transfers from external data locations to remote storage (without in-memory chunking locally) because we need to calculate the md5 hash of the data (which can only be done locally) in order to dvc add it properly. Maybe for DVC 3.0 though, we’ll see! And we still welcome any feature requests on our Github repo.

Thanks again!

Hmm I suppose then there’s something I’m not understanding about what was happening, will look more, but in any case the info here is very helpful, thanks!

Well, it still needs to download and upload each chunk (into memory) so the transfer still gets bottlenecked locally :slightly_smiling_face:

To clarify, “scratch” and “home” filesytems I’m referring to here are two different network filesystems both mounted on the same machine. As far as DVC sees, there are only file copies, no upload/download. It happens that scratch <-> scratch is fast scratch <-> home is slow, the bottleneck I think being home disk and/or network speed, so I don’t think its an issue that data passes through memory (which in this configuration to me seems unavoidable to copy from one filesystem to another).

1 Like

Ah yes good point. I guess still in this case the chunked transfer with Python-level i/o implemented in DVC may still be significantly slower than a native file system call i.e. cp.