Running TPU training workloads on GKE

This reference guide compiles best practices, prescriptive guidance, and code samples for running large-scale machine learning training workloads with TPU v4 and TPU v5e on Google Kubernetes Engine (GKE).

The guide covers two main topics:

Configuring a GKE based environment for large scale training on Cloud TPUs
- This section describes how to configure a GKE cluster to optimize it for running large-scale machine learning training workloads on Cloud TPUs.
Defining, Submitting, and Monitoring Training Jobs
- This section provides guidance on how to define, submit, and manage training jobs using the Kubernetes JobSet and Kueue APIs.

We also include Terraform configuration for provisioning the training environment and code samples for a variety of training workloads.

Architecture of the training environment

The diagram below depicts a high-level architecture of the training environment.

The foundation of the environment is a regional, VPC-native GKE cluster. The cluster has two types of node pools:

A single node pool with CPU-only nodes and
Several Multi-host TPU node pools

This cluster topology supports running both single-slice and multislice TPU training jobs.

Following are the components supporting the environment:

Cloud Storage buckets for saving training datasets and artifacts produced by training jobs (such as logs and checkpoints)
Cloud Artifact Registry for packaging and managing the training, data processing, and other components of a training workload as Docker container images.
Vertex AI TensorBoard for tracking and visualizing training metrics.
Cloud Monitoring for collecting and analyzing non-functional performance metrics
Cloud Logging for managing logs produced by training workloads.
Training workloads impersonate an Identity and Access Management (IAM) service accounts to access Google Cloud services, such as Cloud Storage and Vertex AI TensorBoard.

Training workload processing

The following diagram illustrates the process of submitting and processing training workloads in the training environment.

In this guide we advocate using the Kubernetes JobSet API as the preferred method of coordinating large-scale distributed machine learning training workloads on Kubernetes. When combined with the Kubernetes Kueue job queuing API, it provides flexible and comprehensive training job orchestration.

The training environment's Kueue configuration consists of a single ClusterQueue and multiple LocalQueues. This topology provides basic multi-tenancy and supports managing and prioritizing jobs submitted by multiple teams.

All training workloads are represented as JobSet resources. A JobSet resource may contain multiple job types, such as a core distributed training job and an auxiliary job that manages TensorBoard logs and other artifacts generated by the training job.

JobSet workloads are submitted to a namespaced LocalQueue that points to a ClusterQueue. As illustrated in the diagram, in our reference implementation, there is a single cluster queue.

Kueue monitors when resources (such as TPU slices) required by a workload (JobSet) are available, and then decides when to admit the workload and how to allocate the workload's components to the cluster's node pools.

For example, a training workload can contain two types of jobs:

A multislice distributed training job
A job that uploads TensorBoard logs generated by the training job to Vertex AI TensorBoard

When all the resources required by this workload become available, the training job's workers are started on the requested number of TPU slices. The TensorBoard uploader is started on one of the nodes in the CPU node pool.

If the compute resources required by other submitted workloads are not available, these workloads are queued and scheduled for admission based on the priorities that have been defined in the Kueue configuration.

To submit a JobSet-defined workload, you need to create a YAML JobSet resource definition. There are a few different ways to do this. In this guide, we demonstrate two approaches:

Using Kustomize, which helps you create YAML JobSet resource definitions directly.
Using xpk, which provides an easy-to-use Python-based CLI.

Provision infrastructure

The provisioning of the environment described in the previous section has been automated with Terraform and Cloud Build. The following are the tasks performed by the Terraform configuration:

Creates a network and a subnet for a VPC-native GKE cluster.
Creates a VPC-native cluster.
Creates a node pool with nodes equipped with CPUs only.
Creates a specified number of multi-host TPU node pools.
Creates an IAM service account for Workload Identity and an IAM service account to be used as a custom node pool service account.
Assigns a specified set of roles to these service accounts.
Configures the cluster for Workload Identity.
Creates a Google Cloud Storage bucket.
Adds the service accounts to roles/storage.legacyBucketReader bucket level permissions.

Warning

A few things to note:

You need to be a project owner to set up the environment.
Your project must have sufficient quota to provision TPU resources. Else, you can request for a higher quota limit.

You can use Cloud Shell to start and monitor the setup process. Click on the link below to navigate to Cloud Shell and clone the repo.

To set up the environment execute the following steps.

Select a Google Cloud project

In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
Clone the GitHub repo. Skip this step if you have launched through Cloud Shell link.

git clone https://github.com/jarokaz/tpu-gke-sandbox.git

Configure environment

As mentioned earlier, environment provisioning is done using a Cloud Build job that runs Terraform manifests and environment setup steps. The Terraform configuration can be found in the env_setup/terraform folder. The Terraform configuration supports a number of configurable inputs which are set using the included env_setup/vars.env file. Cloud Build provides Terraform the values set in this file to configure Terraform variables in env_setup/terraform/variables.tf. The configuration uses Google Cloud Storage as the backend for maintaining Terraform state.

Important

To proceed, set the below environment variables in env_setup/vars.env to reflect your environment. By default, you will only need to provide your PROJECT_ID; replace "YOUR_PROJECT_ID" with the project ID of your Google Cloud project.

export PROJECT_ID=YOUR_PROJECT_ID
export RESOURCE_PREFIX=${PROJECT_ID}
export REGION=us-central2
export ZONE=us-central2-b
export NETWORK_NAME=${RESOURCE_PREFIX}-network
export SUBNET_NAME=${RESOURCE_PREFIX}-subnet
export CLUSTER_NAME=gke-tpu-training-cluster
export NAMESPACE=tpu-training
export TPU_TYPE=v4-16
export NUM_TPU_POOLS=1
export NUM_OF_CHIPS=8

export TENSORBOARD_REGION=us-central1
export ARTIFACT_REGISTRY_NAME=gke-tpu-training
export ARTIFACT_REPOSITORY_BUCKET_NAME=${RESOURCE_PREFIX}-aiml-repository

export JOBSET_API_VERSION="v0.2.3"
export KUEUE_API_VERSION="v0.4.2"

export TF_STATE_BUCKET=${PROJECT_ID}-tf-state
export TF_STATE_PREFIX=gke-tpu-training-environment

PROJECT_ID - your project ID
REGION - the region for a GKE cluster network (default: us-central2) based on Cloud TPU region and zones
ZONE - the zone for your GKE cluster (default: us-central2-b)
NETWORK_NAME - the name for the network
SUBNET_NAME - the name for the subnet
CLUSTER_NAME - the name of your GKE cluster (default: gke-tpu-training-cluster)
NAMESPACE - the kubernetes namespace for TPU workloads (default: tpu-training)
TPU_TYPE - the TPU type for the Triton GPU node pool (default: v4-16)
NUM_TPU_POOLS - the number of TPU slices to create (default: 1)
NUM_OF_CHIPS - Number of chips based on the selected TPU type and number of TPU pools
TENSORBOARD_REGION - The region for a Vertex TensorBoard instance (default: us-central1)
ARTIFACT_REPOSITORY_BUCKET_NAME - the name of the model artifacts repository Cloud Storage bucket
ARTIFACT_REGISTRY_NAME - the name of Artifact Registry repository to manage docker images
JOBSET_API_VERSION - the version of the JobSet API to download and setup
KUEUE_API_VERSION - the version of the Kueue API to download and setup
TF_STATE_BUCKET - the name of Cloud Storage bucket for Terraform to maintains configuration state
TF_STATE_PREFIX - the object prefix for Terraform to maintains configuration state in Cloud Storage bucket

Run environment provisioning job

Start provisioning by using Cloud Build job to run Terraform and provision resources, installs the JobSet and Kueue APIs and configures Kueue resources and finalizes the setup. To start provisioning execute the following command:

export PROJECT_ID=YOUR_PROJECT_ID
./env_setup/build.sh

Navigate to the Cloud Build logs using the link displayed on Cloud Shell or go to the Cloud Build page on the Cloud console. You should see similar page when the environment provision job is completed successfully:

Input variables in the Terraform configuration

Note that we only set a subset of variables in env_setup/vars.env exposed by the Terraform configuration. For the other ones we use the defaults. If you want to change the default values of other variables you need to update the env_setup/cloudbuild.provision.yaml file and the gcloud builds submit command in env_setup/build.sh file.

The Terraform configuration supports the following input variables:

Variable	Description	Default
region	The compute region for the Cluster	NA
tensorboard_region	The compute region for Vertex AI TensorBoard	NA
artifact_repository_bucket_name	The name of the GCS bucket	NA
zone	The zone for the TPU node pools. Make sure that the zone supports the required TPU resources	NA
network_name	The name of the network for the cluster	NA
subnet_name	The name of the subnet for the cluster	NA
subnet_ip_range	The IP address range for the subnet	10.129.0.0/20
pods_ip_range	A secondary IP range for pods	192.168.64.0/20
services_ip_range	A secondary IP range for services	192.168.80.0/20
cluster_name	The name of the cluster.	NA
gke_version	The version of GKE to deploy	1.27.3-gke.100
cluster_description	The cluster's description	GKE cluster for running TPU training workloads
cpu_pool_node_count	The number of nodes in a CPU node pool	3
cpu_pool_machine_type	The machine type for the CPU node pool	n1-standard-4
cpu_pool_disk_type	The disk type for nodes in the CPU node pool	pd-standard
cpu_pool_disk_size	The disk size for nodes in the CPU node pool	200GB
tpu_sa_name	The name of the service account that will be provisioned and used for Workload Identity	cloud-tpu-sa
tpu_sa_roles	The roles to assign to the service account	roles/storage.objectAdmin, roles/logging.logWriter
gke_sa_name	The name of the custom service account for node pools	gke-sa
gke_sa_roles	The roles to assigne to the custom service account for node pools	roles/storage.objectAdmin, roles/logging.logWriter
tpu_namespace	The K8s namespace for TPU workloads	tpu-training
tpu_type	The TPU slice type for TPU node pools. See below for more info	v4-16
num_tpu_pools	The number of multi-host TPU node pools to provision	1
enable_tpu_autoscaling	Whether to enable outoscaling of TPU node pools	false
tpu_node_pool_name_prefix	A prefix that will be used to name TPU node pools. An index starting with 0 will be appended to the prefix to form a TPU node pool name	tpu-node-pool
multislice_group_name	A name that will be used to label a TPU node pools to support multislice jobs	multi-slice-group

The tpu_type variable is a name of a TPU slice configuration as defined in the following table.

TPU type name	Slice type	Slice topology	TPU VM type	Number of VMs in a slice	Number of chips in a VM
v5litepod-16	tpu-v5-lite-podslice	4x4	ct5lp-hightpu-4t	4	4
v5litepod-32	tpu-v5-lite-podslice	4x8	ct5lp-hightpu-4t	8	4
v5litepod-64	tpu-v5-lite-podslice	8x8	ct5lp-hightpu-4t	16	4
v5litepod-128	tpu-v5-lite-podslice	8x16	ct5lp-hightpu-4t	32	4
v5litepod-256	tpu-v5-lite-podslice	26x16	ct5lp-hightpu-4t	64	4
v4-8	tpu-v4-podslice	2x2x1	ct4p-hightpu-4t	1	4
v4-16	tpu-v4-podslice	2x2x2	ct4p-hightpu-4t	2	4
v4-32	tpu-v4-podslice	2x2x4	ct4p-hightpu-4t	4	4
v4-64	tpu-v4-podslice	2x4x4	ct4p-hightpu-4t	8	4
v4-128	tpu-v4-podslice	4x4x4	ct4p-hightpu-4t	16	4
v4-256	tpu-v4-podslice	4x4x8	ct4p-hightpu-4t	32	4
v4-512	tpu-v4-podslice	4x8x8	ct4p-hightpu-4t	64	4
v4-1024	tpu-v4-podslice	8x8x8	ct4p-hightpu-4t	128	4
v4-1536	tpu-v4-podslice	8x8x12	ct4p-hightpu-4t	192	4
v4-2048	tpu-v4-podslice	8x8x16	ct4p-hightpu-4t	256	4
v4-4096	tpu-v4-podslice	8x16x16	ct4p-hightpu-4t	512	4

Training workloads examples

The examples folder contains code samples that demonstrate how to configure, submit and manage a number of different training workloads.

Provision infrastructure
Setup prerequisites for running examples
Examples (examples/) demonstrates two approaches to orchestrate large-scale distributed training workloads on GKE using JobSet and xpk
- JobSet (examples/jobset) shows configuring and running training workloads with JobSet and Kueue APIs using Kustomize.
  - TPU Hello World (examples/jobset/tpu_hello_world) shows examples of experimenting with different data and model parallelism strategies with in single slice and multislice TPU configurations.
  - MaxText (examples/jobset/maxtext) shows examples of pre-training a MaxText 6.5B parameter model in both single slice and multislice TPU configurations.
- xpk (examples/xpk) (Accelerated Processing Kit) shows examples of configuring and running training workloads using xpk for same as with JobSet APIs.

Important

Refer to the README in examples folder for detailed instructions.

Cleanup Environment

To destroy the environment and clean up all the provisioned resources, run the Cloud Build job that runs Terraform to clean up the resources. The job refers to the environment variables in env_setup/vars.env that was used for provisioning the environment. To start cleaning up the provisioned resources execute the following command:

./env_setup/destroy.sh

You should see similar page when the environment cleanup job is completed successfully:

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
env_setup		env_setup
examples		examples
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Running TPU training workloads on GKE

Architecture of the training environment

Training workload processing

Provision infrastructure

Select a Google Cloud project

Configure environment

Run environment provisioning job

Input variables in the Terraform configuration

Training workloads examples

Cleanup Environment

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

jarokaz/tpu-gke-sandbox

Folders and files

Latest commit

History

Repository files navigation

Running TPU training workloads on GKE

Architecture of the training environment

Training workload processing

Provision infrastructure

Select a Google Cloud project

Configure environment

Run environment provisioning job

Input variables in the Terraform configuration

Training workloads examples

Cleanup Environment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages