Train using SageMaker but track output using DVC

I wonder what is the best way to use Sagemaker with DVC in particular for running a train step which is part of a DVC pipeline. The problem I am running into is that SageMaker will create a training job and output somewhere different than the machine that invokes the training job.

Prior to using SageMaker our flow was to ssh into the relevant EC2 instance and run dvc repro. Now we have a command that invokes the same train job from our local machine which essentially kicks out train in an EC2 instance of our choice and stores the output in a s3 bucket.

Is there a way to run the training job, sync the output locally and inform DVC that this is the output of running a particular step in the pipeline? If that is complicated, is there a better alternative to working with SageMaker and DVC?

1 Like

Hello @nsorros!
I don’t have any experience with Sagemaker, so please feel free to correct me if my assumptions are wrong.
So, the command that invokes your run is some kind of aws cli command?
After you trigger the build I presume you are not awaiting for the result, but the build at Sagemaker starts and will finish “at some point in the future”? Your local machine does not hang the terminal session that you use for triggering until the run is finished?

Yes in all. The terminal does hang streaming the logs but there is no harm in closing the terminal and reading the logs using the aws cli. Also I am not invoking the job through the aws cli but from the python sdk but the end result is the same. Looking forward to hearing your thoughts

The fact that our training can be detached can be problematic. DVC is build with assumption that after dvc run or dvc repro there will be and output as defined in dvc.yaml. The fact that SageMaker does not do that, can potentially limit the dvc capabilities.

One more question: how does experiment execution look like from SageMaker point of view? Does it clone your project repository and do dvc repro there or, does it just call the training code? How does it look like?

It does not clone the project, it packages the project in a docker container that AWS has created or that we have created, we choose the latter. Our docker container contains all dependencies and then we package the code separately in a tar and send it separaty, essentially there is a bit of code that sagemaker has which copies the code inside before it executes. We do that so we can do quick experimentation i.e. change the code and quickly launch and experiment instead of creating another big container and pushing.

We actually do not run dvc repro inside the container, we run train.py with some args, similar to the command that we have instructed dvc repro to run at that stage. We could trigger dvc repro TRAIN_STAGE instead but would that work? What would it require present for dvc to work properly? cause at the moment it does not have any dvc related files.

Generally I am keen to hear ideas how we can work around it or what dvc requires / assumes to offer suggestions. So from what I understand, there is no way of telling DVC that an output was created from running a dvc repro command somewhere else.

1 Like

@nsorros
Sorry for so much questions, but I am trying to understand how sagemaker works:

how does model file gets created? Is it created in docker image and then sagemaker takes care of moving it to s3?

How do you run train? Do you define some script for sagemaker with the steps that need to be taken to create model?

I am not a expert of Sagemaker but I try to use it with DVC.
Has anyone found a solution to use it with the dvc command?

There is 2 input mode File ( local copy of the s3 bucket in the docker container) or Pipe (stream).

Hi all.
What are the benefits of using sagemaker?

  • spot instances
  • sync intermediate data stored in /opt/ml/checkpoints, /opt/ml/models… to S3 to be able to resume

Sync intermediate data is configurable through sagemaker stimators (plus hardware settings, etc…), so you could change the tracked folders to be your project structure including .dvc and dvc.yml outputs (tracked files by dvc).
Setting them your data will survive across the spot instances if you have implemented a resume mechanism.

Optionally your could use out Task resource of our terraform-provider-iterative
We designed a new resource in our terraform provider that essentially mimics sagemaker behaviour or @nsorros own implementation also without any vendor lock.
It’s a task (bash script) that will try to run from the beginning to the end surviving spot instances syncing all the intermediate data within a bucket.
Works the same in azure, gcp or AWS.

We are looking for beta testers and I will be personally involved with them in setting it successfully. Please ping me if interested!