Add a remote directory without adding to the cache

We have a corpus licensed from a third party that is stored in our S3 bucket. The corpus is fixed and should never change, so versioning is not much of a concern. I’d like to integrate it into my repository so that when someone checks it out from git, they can run a simple command like dvc checkout and it will download the corpus. However, since it is already stored in S3 and should never change, I would prefer that it doesn’t get copied into our remote cache. But, I’m not sure this is really possible with dvc.

One solution might be to use something like dvc run -n get-corpus -O [local-corpus-path] dvc get-url [remote-corpus-path] [local-corpus-path]. Then (I think) someone can just run dvc repro get-corpus or any downstream tasks, and dvc will get the corpus. But, will this work? And is this the best/recommended way?

Hi @nimrand

By corpus, I assume you mean a data set (probably related to NLP). Some sort of ground truth data source.

Seems like you need a download tool? Would aws s3 cp work for you? Installing aws CLI and even a shell script to wrap the exact download command could be integrated to your repo.

Wrapping it with DVC is possible but I’n not sure I understand the use case. Can you share more context if needed?

Thanks

The corpus is a large directory of text documents that are representative of various domains, from which we want to develop language models.

The missing context, I guess, is that I’m already using dvc for other projects, and I’d like to standardize our repos on using dvc for large data dependencies in all our projects across the organization. In previous projects, I used dvc add [large data file] and pushed to a remote cache on S3. So, when a person checks out the repo, they just need to run dvc pull to get the data files they need to run the code.

In this case, however, the large data file already exists in centralized location and should never change, so I’d like to avoid making a copy of the data in the remote cache just for the sake of integrating it with dvc. Ideally, when someone downloads the repo and does a dvc pull it would download the corpus rom the existing S3 source, but from reading the docs I’m not sure this is possible. The alternative I came up with was to wrap it in a task, as I proposed in my original post.

As for aws s3 cp, is there any advantage to that rather than using dvc get-url if we can already assume that users of the repo have dvc installed?

1 Like

Hi @nimrand!
I think there is no way to not have this data and still be able to pull it, though I think there is workaround for your use case.

What you can do is to create a stage that does not have a dependency (callback stage).
In your case it would look like that:
dvc run -n download_data --outs-no-cache data aws s3 cp s3://s3_data_path data
that way we preserve the knowledge how to get the data, if you use get-url, dvc will not preserve information where did you get the data.

Now there is one problem with this approach - callback stages are considered to be always changed, and if you later dvc repro stage depending on callback stage output, dvc will try to rerun this stage - copy the file all over.
To prevent that you can use dvc freeze download_data - dvc wont try to rerun this stage until you unfreeze it.
So then, when you share your project with frozen dwonload_data - someone will get an error that data does not exist, but if he/she dvc unfreeze download_data && dvc repro some_later_stage - dvc will try to download and repro the process.

To illustrate this, I created following example:

#!/bin/bash

# data to import
rm big_data
echo data >> big_data
main=$(pwd)

rm -rf repo
mkdir repo

set -x
pushd repo
git init --quiet
dvc init --quiet

# download data
dvc run -n download_data --outs-no-cache data "cp $main/big_data data"
dvc freeze download_data # so that dvc will not try to download each time repro is called
dvc run -n process -d data -o result "cat data >> result && echo processed >> result"

#everything is present, no problem
dvc repro process

#remove data, repro fails
rm data
dvc repro process

#unfreeze stage, be able to reproduce pipeline
dvc unfreeze download_data
dvc repro process

I’m surprised that dvc would always try to rerun that task, even when the data it produces already exists locally. Is this always the case when a stage depends on an --outs-no-cache of another stage? You called it a “callback stage”. I’m not familiar with that term.

Also, how is dvc run -n download_data --outs-no-cache data aws s3 cp s3://s3_data_path different from dvc run -n download_data --outs-no-cache data dvc get-url s3://s3_data_path data. Doesn’t aws s3 cp s3://s3_data_path and dvc get-url s3://s3_data_path data do the same thing?

@nimrand

Well it is possible to do that too, I missed this possibility. :slight_smile:

So to answer your first question:

  • callback stage: stage that does not have dependencies

dvc will always rerun callback stages.
--outs-no-cache does not matter in this case.
The reason for this behavior is that callback stages have been introduced to let user execute code that verifies whether something changed and let that information into the pipeline.
Example: checking if there are new entries in log storage to trigger importing and processing them.

As to your project: you might also just use data as a dependency in your first pipeline. Dependencies are not added under DVC control (outputs do), so it won’t go into the cache. Upon reproducing pipeline with no data you will get information that it does not exist.

Interesting ideas about using a stage with no dependencies, or an external dependency in a 2-stage pipeline Pawel. But if the goal is to download the data with dvc pull like in the other repos @nimrand has, that wouldn’t help, as dvc repro is needed instead instead.

The problem is that for dvc pull to work, the data would need to be pushed first, meaning added manually to the workspace, tracked by DVC (dvc add for example), and dvc pushed to some remote storage. Another way to put in the workspace and track it in a single step would be dvc import-url. But both these methods duplicate the remote storage of the data set.

Another possible workaround is to add the data to the project without moving it, as an external output. This implies setting up an external cache in the S3 location first, and then do something like dvc add s3://s3_data_path. I’m not sure what happens on the S3 at that point though, the data may be duplicated anyway :confused: and dvc pull still wouldn’t download it to the workspace, as it’s added externally (never pushed in the first place).

dvc will always rerun callback stages .
--outs-no-cache does not matter in this case.
The reason for this behavior is that callback stages have been introduced to let user execute code that verifies whether something changed and let that information into the pipeline.
Example: checking if there are new entries in log storage to trigger importing and processing them.

I would expect the default would be to not rerun stages that don’t have dependencies, and that callback stage could be made by using --always-changed. I suppose I could add a dummy dependency to the get-corpus task, but that’s a bit clunky.

It seems that my use-case wont work with dvc pull without duplicating the data, but maybe making it work via a dvc repro is the next best thing.

1 Like

OK. And like I mentioned, please consider https://dvc.org/doc/command-reference/import-url as another alternative. It will download the data and create a .dvc file for it, as if it was tracked with dvc add, so you can use it as a dependency for further stages and have repro download it automatically if/when needed.

Just re-read that page…

In it, they give an example that says import-url is equivalent to

dvc run -n download_data -d https://data.dvc.org/get-started/data.xml -o data.xml wget https://data.dvc.org/get-started/data.xml -O data.xml

I think that’s almost what I want, except we’d change the first -o parameter to -O to avoid copying the file into the local cache.

But, since it declares a dependency on a remote location, how does it know when its changed other than re-downloading the files every time?

This is not such a big of a deal really. import-url will download it into the cache directly, and then link the file to the workspace (as long as it’s supported by the file system, see this doc for more info.), so there’s no duplication. -O is meant more for small files you want to track with Git or experimental files you don’t want to track at all.

DVC checks the eTag of the original S3 data source when it needs to determine this e.g. at dvc status, dvc repro, etc. (You can use dvc update to bring it up to date, BTW.) But I thought your data set is not expected to ever change anyway?

Except it will then copy the corpus to the remote cache when I do a dvc push, which is what I was trying to avoid, right? Or, am I missing something?

Also, when I try running dvc import-url [s3-url] [local-path] I get a Current operation was unsuccessful because 's3://duolingo-research-data/det/COCA' requires existing cache on 's3' remote. It is not clear to me what setting up a “cache on ‘s3’ remote” requires.

No, it won’t be pushed to any remote as it’s technically an external output in the .dvc file.

:face_with_hand_over_mouth: That sounds like a bug… Mind opening a report to https://github.com/iterative/dvc/issues/new?template=bug_report.md ? Sorry about that.

Done: https://github.com/iterative/dvc/issues/4261

I’m not sure exactly what an “external” output is, but based on what you’ve said it sounds like import-url (when its fixed) gives me exactly what I want.

I think the documentation about how external resources work is a bit confusing, and then the error about requiring an “existing cache” really threw me for a loop. Good to know its a bug.

You are right about our external data docs, we should probably review them. We actually have an old issue about this… https://github.com/iterative/dvc.org/issues/520 feel free to upvote (like) and/or comment there.

As for the bug you bumped into, please follow https://github.com/iterative/dvc/issues/4144 instead, as the one I opened was found to be a duplicate.

Best