Dvc and gcp/colab tricks

nukich74 · December 14, 2020, 8:20pm

I am thinking to switch to dvc, but have issues to resolve:
I have dataset, which is huge. Suppose most of the time I was doing smth like this:
directly downloaded files from gcp or in colab I just mounted drive and took data.
What should I do if I use dvc? In colab if I mount storage, datasets is encripted to repo format.

Paffciu · December 15, 2020, 1:14pm

Hi @nukich74!

So, when you are moutning gcp in colab, I assume it is treated as “local” directory, right?
If you would like to start using DVC you could create git repository for your dataset,
use dvc to version it (dvc add {dataset}) make your gcp so called remote and use dvc in you colab to check out your data via dvc checkout or dvc pull commands.
I recommend going through our get started tutorial, to grasp how dvc works: https://dvc.org/doc/start

nukich74 · December 15, 2020, 1:30pm

hi! this will lead to double coping of data. if I mount drive, I can just read dataset asynchronously. with dvc I have to copy it to local env which is not effective. can I init repo from mounted drive without coping data?

Paffciu · December 15, 2020, 1:34pm

@nukich74
That depends on how does mounting looks like on colab. DVC provides a way to use links instead of copies, so that user does not duplicate data. If colab will be able to create symlink or hardlink you could avoid data duplication.
Take a look at: https://dvc.org/doc/user-guide/large-dataset-optimization

Topic		Replies	Views
Data (registry) and remote GPU cluster with local DVC repositories Questions	6	715	July 5, 2022
Using DVC for non-machine learning models Questions	1	806	October 2, 2020
Add local google drive folder as dvc remote Questions	0	29	February 21, 2025
DVC local storage usecase Questions	6	1605	January 20, 2021
Dataset in another repository Questions	4	69	March 26, 2025

Dvc and gcp/colab tricks

Related topics