Pull data from Gdrive in GH actions?

rmbzmb · July 15, 2022, 1:32pm

Hi, I was wondering how I have to setup the github actions yaml file, so that it pulls the data and the model from google drive? Maybe I need to make the Gdrive folder public, or add my Gdrive credentials? Right now, the sanity-check fails using the workflow file from the video.

test.yaml:

name: auto-testing
on: [push]
jobs:
  run:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml-py3:latest
    steps:
      - uses: actions/checkout@v2
      - name: sanity-check
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
        run: |
          # Your ML workflow goes here
          pip install -r requirements.txt
          python test.py

YanxiangGao · July 16, 2022, 3:55am

If you want to combine DVC with Github actions to achieve some CI automation, maybe you should take a look at our another product CML. In its documents it provides some info on how to setup this.

env:
GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}

rmbzmb · July 16, 2022, 8:23am

https://help.talend.com/r/en-US/7.2/google-drive/how-to-access-google-drive-using-client-secret-json-file-the

First, you have to enable the Google Drive API.
Then add a service account and create credentials as a JSON file.
That file can then be placed in the .dvc/tmp/ folder.
Then you need to modify the storage:

dvc remote modify storage --local gdrive_user_credentials_file .dvc/tmp/gdrive-credentials.json

However, in my case, that led to the following error:

dvc pull
ERROR: configuration error - GDrive remote auth failed with credentials in '.../.dvc/tmp/gdrive-credentials.json'.
Backup first, remove or fix them, and run again.
It should do auth again and refresh the credentials.

Details:: '_module'
ERROR: GDrive remote auth failed with credentials in '.../.dvc/tmp/gdrive-credentials.json'.
Backup first, remove or fix them, and run again.
It should do auth again and refresh the credentials.

Details:
Learn more about configuration settings at <https://man.dvc.org/remote/modify>.

YanxiangGao · July 17, 2022, 2:45am

Excuse me, does this credentials file works on a local computer?

YanxiangGao · July 18, 2022, 4:58am

Hi, Could you please try to test it locally if it fails because Github Action or Because of credential problem?

shcheklein · July 18, 2022, 5:04am

@rmbzmb if you are using a service account you need to follow the instruction here:

Namely, you would not use the gdrive_user_credentials_file, instead you should specify:

dvc remote modify myremote gdrive_use_service_account true
dvc remote modify myremote --local \
              gdrive_service_account_json_file_path path/to/file.json

On CI you can then set the GDRIVE_CREDENTIALS_DATA to the content of the JSON file with the service account credentials.

Please let us know if this still doesn’t work.

rmbzmb · July 18, 2022, 8:42am

That means, I have to use the json file (or the content) on both the local version and the GitHub version? I cannot not use the json file locally and GDRIVE_CREDENTIALS_DATA on GitHub at the same time?

I tried it now. Locally it the json file works, on GitHub, I get the same error message as before.
Is there an example yaml file available somewhere?

name: auto-testing
on: [push]
jobs:
  run:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml-py3:latest
    steps:
      - uses: actions/checkout@v2
      - name: sanity-check
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
          GDRIVE_CREDENTIALS_DATA : ${{ secrets.GDRIVE_CREDENTIALS_DATA }}          

        run: |
          # Your ML workflow goes here
          pip install -r requirements.txt
          dvc pull data
          dvc repro

The data pull still fails with the same error message.

WARNING: You are using pip version 21.1; however, version 21.3.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
/usr/local/lib/python3.6/dist-packages/pycaret/loggers/mlflow_logger.py:14: FutureWarning: MLflow support for Python 3.6 is deprecated and will be dropped in an upcoming release. At that point, existing Python 3.6 workflows that use MLflow will continue to work without modification, but Python 3.6 users will no longer get access to the latest MLflow features and bugfixes. We recommend that you upgrade to Python 3.7 or newer.
Pycaret: 2.3.10
  import mlflow
Traceback (most recent call last):
  File "src/test.py", line 55, in <module>
    (x_train, y_train), (x_test, y_test)  = mdl.load_data()
  File "src/test.py", line 44, in load_data
    x_train, y_train = self.read_images_labels(self.training_images_filepath, self.training_labels_filepath)
  File "src/test.py", line 22, in read_images_labels
    with open(labels_filepath, 'rb') as file:
FileNotFoundError: [Errno 2] No such file or directory: 'data/MINST/train/train-labels-idx1-ubyte'
Error: Process completed with exit code 1.

Is it enough to pull the parent directory of all the data files, or do I need to pull them all individually specifying all the file names?

YanxiangGao · July 18, 2022, 10:58am

Hi @rmbzmb You don’t need to use a json file, you just need to copy the contents of json file and set it to the GDRIVE_CREDENTIALS_DATA on Github.

rmbzmb · July 18, 2022, 11:26am

Yes, I am using the JSON file only on my local computer and on a remote workstation. On GitHub I copy-pasted the content of the JSON file into a GitHub secret named GDRIVE_CREDENTIALS_DATA. But it is still not working.

    ...
    GDRIVE_CREDENTIALS_DATA : ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
    ....

rmbzmb · July 18, 2022, 1:40pm

It seems dvc on GitHub is ignoring GDRIVE_CREDENTIALS_DATA, even though it is set.

45ERROR: failed to pull data from the cloud - To use service account, set gdrive_service_account_json_file_path, and optionallygdrive_service_account_user_email in DVC config

rmbzmb · July 19, 2022, 10:19am

Seems there is an error in dvc currently: gdrive: raises unexpected error - name: drive version: v2 (again) · Issue #7949 · iterative/dvc · GitHub

Topic		Replies	Views
CML + Github actions + Google Drive / Service Account Questions	16	2606	July 18, 2022
ERROR: configuration error - GDrive remote auth failed with credentials in 'GDRIVE_CREDENTIALS_DATA' Questions	13	1756	July 20, 2022
Gdrive-user-credentials.json is missing Questions	4	1084	January 6, 2023
Manually prompt Gdrive authentication step Questions	3	1145	October 29, 2022
Configure: DVC + CircleCI + GDRIVE Questions	2	769	December 20, 2021

Pull data from Gdrive in GH actions?

Related topics