Skip to content

Commit fb8d14e

Browse files
authored
Implement per region instance type config for canary and e2e tests (#87)
Description of changes: Currently, we are using one fixed instance type across all AWS regions in our endpoint and training job tests. However, certain regions do not support the currently specified instance type or require a limit increase to use that instance type. Specifically, canary tests in the eu-west-3 and eu-north-1 regions are failing due to this issue. This pull request updates the testing resource config file `replacement_values.py` to pass in the correct instance type depending on region. Regions that did not experience this issue will continue to use the previous instance type via the new config to avoid breaking canaries/e2e testing in those regions. The changes have been tested in our eu-west-3 and eu-north-1 canary stacks and have resulted in passing canaries. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
1 parent 84b4de3 commit fb8d14e

7 files changed

+19
-7
lines changed

test/canary/Dockerfile.canary

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ RUN curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.18.6/b
3030
&& cp ./kubectl /bin
3131

3232
# Install eksctl
33-
RUN curl --silent --location "https://github.com/weaveworks/eksctl/releases/download/latest_release/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp && mv /tmp/eksctl /bin
33+
RUN curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp && mv /tmp/eksctl /bin
3434

3535
# Install Helm
3636
RUN curl -q -L "https://get.helm.sh/helm-v3.2.4-linux-amd64.tar.gz" | tar zxf - -C /usr/local/bin/ \

test/e2e/replacement_values.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,16 @@
166166
"eu-south-1": "638885417683.dkr.ecr.eu-south-1.amazonaws.com",
167167
}
168168

169+
ENDPOINT_INSTANCE_TYPES = {
170+
"eu-west-3": "ml.m5.large",
171+
"eu-north-1": "ml.m5.large",
172+
}
173+
174+
TRAINING_JOB_INSTANCE_TYPES = {
175+
"eu-west-3": "ml.m5.xlarge",
176+
"eu-north-1": "ml.m5.xlarge",
177+
}
178+
169179
REPLACEMENT_VALUES = {
170180
"SAGEMAKER_DATA_BUCKET": get_bootstrap_resources().DataBucketName,
171181
"XGBOOST_IMAGE_URI": f"{XGBOOST_IMAGE_URIS[get_region()]}/sagemaker-xgboost:1.0-1-cpu-py3",
@@ -175,4 +185,6 @@
175185
"SAGEMAKER_EXECUTION_ROLE_ARN": get_bootstrap_resources().ExecutionRoleARN,
176186
"MODEL_MONITOR_ANALYZER_IMAGE_URI": f"{MODEL_MONITOR_IMAGE_URIS[get_region()]}/sagemaker-model-monitor-analyzer",
177187
"CLARIFY_IMAGE_URI": f"{CLARIFY_IMAGE_URIS[get_region()]}/sagemaker-clarify-processing:1.0",
188+
"ENDPOINT_INSTANCE_TYPE": ENDPOINT_INSTANCE_TYPES.get(get_region(), 'ml.c5.large'),
189+
"TRAINING_JOB_INSTANCE_TYPE": TRAINING_JOB_INSTANCE_TYPES.get(get_region(), 'ml.m4.xlarge')
178190
}

test/e2e/resources/endpoint_config_data_capture_single_variant.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ spec:
77
productionVariants:
88
- modelName: $MODEL_NAME
99
variantName: AllTraffic
10-
instanceType: ml.c5.large
10+
instanceType: $ENDPOINT_INSTANCE_TYPE
1111
initialVariantWeight: 1
1212
initialInstanceCount: 1
1313
dataCaptureConfig:

test/e2e/resources/endpoint_config_multi_variant.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,12 @@ spec:
99
modelName: $MODEL_NAME
1010
initialInstanceCount: 1
1111
# This is the smallest instance type which will support scaling
12-
instanceType: ml.c5.large
12+
instanceType: $ENDPOINT_INSTANCE_TYPE
1313
initialVariantWeight: 1
1414
- variantName: variant-2
1515
modelName: $MODEL_NAME
1616
initialInstanceCount: 1
17-
instanceType: ml.c5.large
17+
instanceType: $ENDPOINT_INSTANCE_TYPE
1818
initialVariantWeight: 1
1919
tags:
2020
- key: confidentiality

test/e2e/resources/endpoint_config_single_variant.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ spec:
1010
# instanceCount is 2 to test retainAllVariantProperties
1111
initialInstanceCount: 2
1212
# This is the smallest instance type which will support scaling
13-
instanceType: ml.c5.large
13+
instanceType: $ENDPOINT_INSTANCE_TYPE
1414
initialVariantWeight: 1
1515
tags:
1616
- key: confidentiality

test/e2e/resources/xgboost_trainingjob.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ spec:
2121
s3OutputPath: s3://$SAGEMAKER_DATA_BUCKET/sagemaker/training/output
2222
resourceConfig:
2323
instanceCount: 1
24-
instanceType: ml.m4.xlarge
24+
instanceType: $TRAINING_JOB_INSTANCE_TYPE
2525
volumeSizeInGB: 5
2626
stoppingCondition:
2727
maxRuntimeInSeconds: 86400

test/e2e/resources/xgboost_trainingjob_debugger.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ spec:
2121
s3OutputPath: s3://$SAGEMAKER_DATA_BUCKET/sagemaker/training/debugger/output
2222
resourceConfig:
2323
instanceCount: 1
24-
instanceType: ml.m4.xlarge
24+
instanceType: $TRAINING_JOB_INSTANCE_TYPE
2525
volumeSizeInGB: 5
2626
stoppingCondition:
2727
maxRuntimeInSeconds: 86400

0 commit comments

Comments
 (0)