I am thinking to switch to dvc, but have issues to resolve:
I have dataset, which is huge. Suppose most of the time I was doing smth like this:
directly downloaded files from gcp or in colab I just mounted drive and took data.
What should I do if I use dvc? In colab if I mount storage, datasets is encripted to repo format.
Hi @nukich74!
So, when you are moutning gcp in colab, I assume it is treated as “local” directory, right?
If you would like to start using DVC you could create git repository for your dataset,
use dvc to version it (dvc add {dataset}
) make your gcp so called remote
and use dvc
in you colab to check out your data via dvc checkout
or dvc pull
commands.
I recommend going through our get started tutorial, to grasp how dvc works: https://dvc.org/doc/start
hi! this will lead to double coping of data. if I mount drive, I can just read dataset asynchronously. with dvc I have to copy it to local env which is not effective. can I init repo from mounted drive without coping data?
@nukich74
That depends on how does mounting looks like on colab. DVC provides a way to use links instead of copies, so that user does not duplicate data. If colab will be able to create symlink
or hardlink
you could avoid data duplication.
Take a look at: https://dvc.org/doc/user-guide/large-dataset-optimization