Edit on GitHub

Amazon SageMaker


Many DVC features rely on Git. To work with DVC in Amazon SageMaker, first setup your Git repo:

  1. Clone a repository.

    Git clone

  2. Launch a terminal or notebook and configure the Git user name and email:

    git config --global user.name ...
    git config --global user.email ...
  3. Don't forget to install DVC and any other requirements in your environment!

    pip install dvc dvclive


After completing the setup, you can work with DVC in SageMaker notebooks like you would in any other environment. Take a look at DVC experiments for how to get started with DVC in notebooks (if you have setup code-server on SageMaker, you can also install the DVC extension for VS Code).

If you would like to see live experiment updates in DVC Studio, get your token and save it in your dvc config or DVC_STUDIO_TOKEN environment variable. For example, to set it globally for all of a user's projects:

$ dvc config --global studio.token ***

While the experiment runs, you will see live updates like this in DVC Studio:



You can run SageMaker jobs in DVC pipelines or convert existing SageMaker pipelines into DVC pipelines. This combines the benefits of SageMaker jobs, like running each stage on its own EC2 instance and enabling other data input modes, with the benefits of DVC pipelines, like skipping unchanged stages and tracking the inputs and outputs of each run. SageMaker expects all inputs and outputs to be stored in S3, so the easiest way to integrate with DVC is to use S3 storage, and utilize external dependencies and outputs.

Example: XGBoost pipeline

For an example, see https://github.com/iterative/sagemaker-pipeline, which adapts an existing SageMaker tutorial from a notebook into a DVC pipeline. The first stage (prepare) downloads the data and tracks the output so that it doesn't have to be re-downloaded on each run. We parametrize the bucket and prefix of the destination into a separate params.yaml file so they can be modified easily. The DVC pipeline stage is defined in dvc.yaml like this:

    - wget
      -O bank-additional.zip
    - python sm_prepare.py --bucket ${bucket} --prefix ${prefix}
    - sm_prepare.py
    - https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
    - s3://${bucket}/${prefix}/input_data:
        cache: false

The preprocessing script takes bucket and prefix as arguments and otherwise is copied directly from the original notebook code, which uses a SageMaker Processing job. The DVC pipeline stage tracks the command, scripts, input paths, and outputs paths, so that this stage will only be run again if any of those change:

  cmd: python sm_preprocessing.py --bucket ${bucket} --prefix ${prefix}
    - sm_preprocessing.py
    - preprocessing.py
    - s3://${bucket}/${prefix}/input_data
    - s3://${bucket}/${prefix}/train:
        cache: false
    - s3://${bucket}/${prefix}/validation:
        cache: false
    - s3://${bucket}/${prefix}/test:
        cache: false

Finally, the training script uses the SageMaker Estimator for XGBoost to train a model. We add all the model hyperparameters as arguments to make it easy to tune hyperparameters and track what changed. Hyperparameters are added under the train key in params.yaml. The DVC pipeline stage cmd includes ${train} to unpack and pass all those arguments and track them as parameters, in addition to tracking the other inputs and outputs:

  cmd: python sm_training.py --bucket ${bucket} --prefix ${prefix}  ${train}
    - sm_training.py
    - s3://${bucket}/${prefix}/train
    - s3://${bucket}/${prefix}/validation
    - s3://${bucket}/${prefix}/output:
        cache: false

The end result of running the pipeline looks like this:



🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat