We are considering to start using DVC and integrate it to our current git repo, and I have some questions about what is the best practice to integrate DVC once you already have a flow you work with.
Our current data workflow, before using DVC, is as follows:
- All is the is stored in a bucket in s3 under a specific folder (we can call it the “resources folder”). This folder is in the folder tree of the git repo but the whole resources folder is specified in the gitignore file of the git repo.
- We have a configuration yaml file in which we specify which data in the resources folder on S3 we want to use.
- The configuration yaml file is connected to the code we use to train the model.
- Once the training process starts the code first checks if data specified in the configuration yaml file is already present locally or not, and if the file is missing then the code gets the data from the resources folder in the S3 bucket to the local computer.
- model is being trained.
- Output model files (e.g.,weights file) created when the model training is finished are saved locally and then uploaded (using the code, not manually) to the resources folder in S3.
- Locally means it is either the local computer (if the amount of data is small) or something like an EC2 (if the amount of data is too large to train on the local computer).
My questions are:
a. We will need to exclude the whole content of the resources folder from the gitignoreץ
b. Then add one file at a time to dvc tracking using
dvc add ?
c. Followed by adding the dvc file created from the
dvc add and the raw file to gitignore, add them to git and commit.
dvc push to remote storage (that will also be in s3, but in a different bucket).
e. Once we register all the files we will no longer need the files in the resources folder from which we started right?
Any advices on how to best integrate dvc to the current workflow and other best practices will be greatly appreciated.