Integrate DVC to an existing github repo with S3

ayalaall · October 17, 2021, 1:49pm

Hi everyone,

We are considering to start using DVC and integrate it to our current git repo, and I have some questions about what is the best practice to integrate DVC once you already have a flow you work with.

Our current data workflow, before using DVC, is as follows:

All is the is stored in a bucket in s3 under a specific folder (we can call it the “resources folder”). This folder is in the folder tree of the git repo but the whole resources folder is specified in the gitignore file of the git repo.
We have a configuration yaml file in which we specify which data in the resources folder on S3 we want to use.
The configuration yaml file is connected to the code we use to train the model.
Once the training process starts the code first checks if data specified in the configuration yaml file is already present locally or not, and if the file is missing then the code gets the data from the resources folder in the S3 bucket to the local computer.
model is being trained.
Output model files (e.g.,weights file) created when the model training is finished are saved locally and then uploaded (using the code, not manually) to the resources folder in S3.

Locally means it is either the local computer (if the amount of data is small) or something like an EC2 (if the amount of data is too large to train on the local computer).

My questions are:
a. We will need to exclude the whole content of the resources folder from the gitignoreץ
b. Then add one file at a time to dvc tracking using dvc add ?
c. Followed by adding the dvc file created from the dvc add and the raw file to gitignore, add them to git and commit.
d. Do dvc push to remote storage (that will also be in s3, but in a different bucket).
e. Once we register all the files we will no longer need the files in the resources folder from which we started right?

Any advices on how to best integrate dvc to the current workflow and other best practices will be greatly appreciated.

Thank you,
Ayala

kupruser · October 18, 2021, 2:12pm

Hi @ayalaall ,

You could dvc add the whole folder instead of one file at a time, if that folder is logically one dataset. Other than that, looks reasonable to me.

Topic		Replies	Views
Output '' is already tracked by SCM (e.g. Git) Bug Reports	6	3090	April 7, 2022
DVC - can’t I track directly an S3 remote data? Questions	1	1272	July 12, 2019
Tracking files stored in S3 without adding it into local storage Questions	4	1090	July 5, 2023
DVC Heartbeat - Discord gems Announcements	3	4165	June 27, 2019
Using DVC outside git Questions	10	1220	January 11, 2022

Integrate DVC to an existing github repo with S3

Related topics