This repository contains examples and utilities for performing distributed deep learning training on Databricks using various frameworks with PyTorch. The examples focus on image classification tasks using popular datasets and model architectures. This is NOT an official Databricks repository.
Training deep learning models at scale requires distributed training capabilities. This repository demonstrates how to leverage different distributed training frameworks on Databricks to accelerate model training for computer vision tasks.
Confirmed for DBR:
- 15.4.x-gpu-ml-scala2.12
Cluster Spec used:
- g5.24xlarge [A10G]
- 384 GB Memory, 4 GPUs
We can split our training across multiple GPUs and even multiple nodes with GPUs to expedite training.

- CIFAR-10/100
- TinyImageNet
- ImageNet-1K
- ResNet18
- ResNet50
This repository includes examples for the following distributed training frameworks:
- PyTorch Distributor - Native PyTorch distributed training with Databricks' Torch Distributor
- DeepSpeed - Microsoft's deep learning optimization library
- Composer - MosaicML's training library for efficient deep learning
- Accelerate - Hugging Face's library for easy distributed training
- Ray - Distributed computing framework with PyTorch integration
.
├── 01_torch_distributor/ # Examples using PyTorch's native distributed capabilities with Torch Distributor
├── 02_deepspeed/ # Examples using Microsoft DeepSpeed
├── 03_composer/ # Examples using MosaicML Composer
├── 04_accelerate/ # Examples using Hugging Face Accelerate
├── 05_ray/ # Examples using Ray for distributed training
├── setup/ # Setup scripts and utilities for Databricks clusters
├── utils/ # Common utility functions for data loading, metrics, etc.
├── .gitignore # Git ignore file
├── LICENSE # License information
└── requirements.txt # Python package dependencies
- Databricks Runtime for Machine Learning (DBR ML) 15.4 or later
- GPU-enabled Databricks cluster
- Clone this repository to your Databricks workspace:
git clone https://github.com/username/distributed-pytorch-databricks.git- Install the required dependencies:
pip install -r requirements.txt- Configure your Databricks cluster using the setup scripts provided in the
setup/directory.
Each framework directory contains notebooks and scripts that demonstrate:
- Data loading and preprocessing for the supported datasets
- Model definition and configuration
- Distributed training setup and execution
- Evaluation and metrics tracking
- Integration with Databricks MLflow for experiment tracking
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the terms specified in the LICENSE file.
- PyTorch team
- Microsoft DeepSpeed
- MosaicML Composer
- Hugging Face Accelerate
- Ray Project
- Databricks for their ML runtime and infrastructure