Sorry for the confusion, what I am thinking is a flow where the code and data lives in 2 places, an EC2 instance and locally and synced through GitHub and DVC. This is our current setup for most projects. We use an EC2 instance to train when needed, otherwise we train locally.
So in this setup sometimes I develop locally but want to train in the EC2 instance. Ideally I do not want to run dvc pull and get all data locally, I want to be able to exclude some due to the size of the files. And same for other members of the team, if they go to my project and want to test a pipeline works, I do not want them to have to download 25-100GB of data with dvc pull. Ideally I want to include a sample dataset that they can use in a sample pipeline or something. I might use that as well for local development.
If that flow does not make sense and a better solution is to never get the data through dvc pull both locally and in the ec2 instance and always treat them as external data then I guess that is my answer. This problem should be more prominent as the data and the models get larger I would assume.
Hope this makes sense, I understand it might be confusing, very curious to hear more about how you would think about it and I am happy to provide more context.