Pull data from Gdrive in GH actions?

Hi, I was wondering how I have to setup the github actions yaml file, so that it pulls the data and the model from google drive? Maybe I need to make the Gdrive folder public, or add my Gdrive credentials? Right now, the sanity-check fails using the workflow file from the video.

test.yaml:

name: auto-testing
on: [push]
jobs:
  run:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml-py3:latest
    steps:
      - uses: actions/checkout@v2
      - name: sanity-check
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
        run: |
          # Your ML workflow goes here
          pip install -r requirements.txt
          python test.py

If you want to combine DVC with Github actions to achieve some CI automation, maybe you should take a look at our another product CML. In its documents it provides some info on how to setup this.

env:
GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}

https://help.talend.com/r/en-US/7.2/google-drive/how-to-access-google-drive-using-client-secret-json-file-the

  • First, you have to enable the Google Drive API.
  • Then add a service account and create credentials as a JSON file.
  • That file can then be placed in the .dvc/tmp/ folder.
  • Then you need to modify the storage:
dvc remote modify storage --local gdrive_user_credentials_file .dvc/tmp/gdrive-credentials.json

However, in my case, that led to the following error:

dvc pull
ERROR: configuration error - GDrive remote auth failed with credentials in '.../.dvc/tmp/gdrive-credentials.json'.
Backup first, remove or fix them, and run again.
It should do auth again and refresh the credentials.

Details:: '_module'
ERROR: GDrive remote auth failed with credentials in '.../.dvc/tmp/gdrive-credentials.json'.
Backup first, remove or fix them, and run again.
It should do auth again and refresh the credentials.

Details:
Learn more about configuration settings at <https://man.dvc.org/remote/modify>.

Excuse me, does this credentials file works on a local computer?

Hi, Could you please try to test it locally if it fails because Github Action or Because of credential problem?

@rmbzmb if you are using a service account you need to follow the instruction here:

Namely, you would not use the gdrive_user_credentials_file, instead you should specify:

dvc remote modify myremote gdrive_use_service_account true
dvc remote modify myremote --local \
              gdrive_service_account_json_file_path path/to/file.json

On CI you can then set the GDRIVE_CREDENTIALS_DATA to the content of the JSON file with the service account credentials.

Please let us know if this still doesn’t work.

That means, I have to use the json file (or the content) on both the local version and the GitHub version? I cannot not use the json file locally and GDRIVE_CREDENTIALS_DATA on GitHub at the same time?

I tried it now. Locally it the json file works, on GitHub, I get the same error message as before.
Is there an example yaml file available somewhere?

name: auto-testing
on: [push]
jobs:
  run:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml-py3:latest
    steps:
      - uses: actions/checkout@v2
      - name: sanity-check
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
          GDRIVE_CREDENTIALS_DATA : ${{ secrets.GDRIVE_CREDENTIALS_DATA }}          

        run: |
          # Your ML workflow goes here
          pip install -r requirements.txt
          dvc pull data
          dvc repro

The data pull still fails with the same error message.

WARNING: You are using pip version 21.1; however, version 21.3.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
/usr/local/lib/python3.6/dist-packages/pycaret/loggers/mlflow_logger.py:14: FutureWarning: MLflow support for Python 3.6 is deprecated and will be dropped in an upcoming release. At that point, existing Python 3.6 workflows that use MLflow will continue to work without modification, but Python 3.6 users will no longer get access to the latest MLflow features and bugfixes. We recommend that you upgrade to Python 3.7 or newer.
Pycaret: 2.3.10
  import mlflow
Traceback (most recent call last):
  File "src/test.py", line 55, in <module>
    (x_train, y_train), (x_test, y_test)  = mdl.load_data()
  File "src/test.py", line 44, in load_data
    x_train, y_train = self.read_images_labels(self.training_images_filepath, self.training_labels_filepath)
  File "src/test.py", line 22, in read_images_labels
    with open(labels_filepath, 'rb') as file:
FileNotFoundError: [Errno 2] No such file or directory: 'data/MINST/train/train-labels-idx1-ubyte'
Error: Process completed with exit code 1.

Is it enough to pull the parent directory of all the data files, or do I need to pull them all individually specifying all the file names?

Hi @rmbzmb You don’t need to use a json file, you just need to copy the contents of json file and set it to the GDRIVE_CREDENTIALS_DATA on Github.

Yes, I am using the JSON file only on my local computer and on a remote workstation. On GitHub I copy-pasted the content of the JSON file into a GitHub secret named GDRIVE_CREDENTIALS_DATA. But it is still not working.

    ...
    GDRIVE_CREDENTIALS_DATA : ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
    ....

It seems dvc on GitHub is ignoring GDRIVE_CREDENTIALS_DATA, even though it is set.

45ERROR: failed to pull data from the cloud - To use service account, set gdrive_service_account_json_file_path, and optionallygdrive_service_account_user_email in DVC config

Seems there is an error in dvc currently: gdrive: raises unexpected error - name: drive version: v2 (again) · Issue #7949 · iterative/dvc · GitHub