Best practice for handling large data

I am working with a 25GB raw dataset which when processed to be prepared for training produces another 10+GB file. Using dvc push and dvc pull with s3 takes a significant amount of time to move data through the network. Is there a way to keep the data only in s3? or use a flag that chooses when to pull or push but that knows that a stage has run and does not need reruning?

What would be the recommended way of working with the data both locally and using a remote?

Hey @nsorros,

Can you please clarify why you need to push/pull data constantly? Remote storage is an “optional” (thought very common) feature, meaning that you can do your entire pipeline locally and only sync to/from remote once (in a while).

Is there a way to keep the data only in s3?

DVC can track data externally without ever downloading it, yes. But it’s a very advanced/tricky feature that we need to redesign so not something I would definitely recommend… See https://dvc.org/doc/user-guide/managing-external-data.

or use a flag that chooses when to pull or push but that knows that a stage has run

This is the part I’m not getting. When do you want push/pull to happen? If you can share more details on your workflow we can probably give you a better answer.

recommended way of working with the data both locally and using a remote?

To recap, remote storage is meant for sharing and backing up raw data, results, models, etc. But if you’re using a network connection to sync it then it’s probably not a good idea to rely on it for day-to-day pipeline executions involving very large files. You may be interested in things like https://dvc.org/doc/use-cases/shared-development-server or https://dvc.org/doc/command-reference/add#example-transfer-to-remote-storage but again, depends on your goals.

Thanks

Sorry for the confusion, what I am thinking is a flow where the code and data lives in 2 places, an EC2 instance and locally and synced through GitHub and DVC. This is our current setup for most projects. We use an EC2 instance to train when needed, otherwise we train locally.

So in this setup sometimes I develop locally but want to train in the EC2 instance. Ideally I do not want to run dvc pull and get all data locally, I want to be able to exclude some due to the size of the files. And same for other members of the team, if they go to my project and want to test a pipeline works, I do not want them to have to download 25-100GB of data with dvc pull. Ideally I want to include a sample dataset that they can use in a sample pipeline or something. I might use that as well for local development.

If that flow does not make sense and a better solution is to never get the data through dvc pull both locally and in the ec2 instance and always treat them as external data then I guess that is my answer. This problem should be more prominent as the data and the models get larger I would assume.

Hope this makes sense, I understand it might be confusing, very curious to hear more about how you would think about it and I am happy to provide more context.

1 Like

Got it.

You can give specific targets to dvc pull. With the --glob option you can even use wildcards and other patterns. See https://dvc.org/doc/command-reference/pull (Same for dvc push)

You can make a separate Git branch in your repo, put the sample there, dvc commit/repro to update DVC metafiles, and git+dvc push it. They can then get that branch and dvc pull will download the smaller sample.

Does that make sense or am I misunderstanding the situation? Thanks

Hi Jorge,

That makes sense. I guess I missed the glob option which seems to do what I am looking for.

Interesting idea about the separate branch, thanks for the suggestion.

Super helpful and again thanks for the quick response.

1 Like

Glad I could help!

Post must be at least 20 characters