Best practice for handling large data

nsorros · April 13, 2021, 9:07am

I am working with a 25GB raw dataset which when processed to be prepared for training produces another 10+GB file. Using dvc push and dvc pull with s3 takes a significant amount of time to move data through the network. Is there a way to keep the data only in s3? or use a flag that chooses when to pull or push but that knows that a stage has run and does not need reruning?

What would be the recommended way of working with the data both locally and using a remote?

jorgeorpinel · April 14, 2021, 5:48pm

Hey @nsorros,

Can you please clarify why you need to push/pull data constantly? Remote storage is an “optional” (thought very common) feature, meaning that you can do your entire pipeline locally and only sync to/from remote once (in a while).

Is there a way to keep the data only in s3?

DVC can track data externally without ever downloading it, yes. But it’s a very advanced/tricky feature that we need to redesign so not something I would definitely recommend… See External Data.

or use a flag that chooses when to pull or push but that knows that a stage has run

This is the part I’m not getting. When do you want push/pull to happen? If you can share more details on your workflow we can probably give you a better answer.

recommended way of working with the data both locally and using a remote?

To recap, remote storage is meant for sharing and backing up raw data, results, models, etc. But if you’re using a network connection to sync it then it’s probably not a good idea to rely on it for day-to-day pipeline executions involving very large files. You may be interested in things like https://dvc.org/doc/use-cases/shared-development-server or add but again, depends on your goals.

Thanks

nsorros · April 15, 2021, 3:50pm

Sorry for the confusion, what I am thinking is a flow where the code and data lives in 2 places, an EC2 instance and locally and synced through GitHub and DVC. This is our current setup for most projects. We use an EC2 instance to train when needed, otherwise we train locally.

So in this setup sometimes I develop locally but want to train in the EC2 instance. Ideally I do not want to run dvc pull and get all data locally, I want to be able to exclude some due to the size of the files. And same for other members of the team, if they go to my project and want to test a pipeline works, I do not want them to have to download 25-100GB of data with dvc pull. Ideally I want to include a sample dataset that they can use in a sample pipeline or something. I might use that as well for local development.

If that flow does not make sense and a better solution is to never get the data through dvc pull both locally and in the ec2 instance and always treat them as external data then I guess that is my answer. This problem should be more prominent as the data and the models get larger I would assume.

Hope this makes sense, I understand it might be confusing, very curious to hear more about how you would think about it and I am happy to provide more context.

jorgeorpinel · April 15, 2021, 6:46pm

Got it.

You can give specific targets to dvc pull. With the --glob option you can even use wildcards and other patterns. See pull (Same for dvc push)

You can make a separate Git branch in your repo, put the sample there, dvc commit/repro to update DVC metafiles, and git+dvc push it. They can then get that branch and dvc pull will download the smaller sample.

Does that make sense or am I misunderstanding the situation? Thanks

nsorros · April 16, 2021, 7:29am

Hi Jorge,

That makes sense. I guess I missed the glob option which seems to do what I am looking for.

Interesting idea about the separate branch, thanks for the suggestion.

Super helpful and again thanks for the quick response.

jorgeorpinel · April 16, 2021, 6:37pm

Glad I could help!

Post must be at least 20 characters

Topic		Replies	Views
Access remote data instead of downloading it Questions	8	667	March 3, 2023
Looking for Workflow Suggestion Questions	2	186	December 21, 2023
DVC - can’t I track directly an S3 remote data? Questions	1	1272	July 12, 2019
Remote s3 cache storage with minio Questions	5	3372	January 30, 2023
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	779	March 22, 2023

Best practice for handling large data

Related topics