Distributed PyTorch Training on Databricks

This repository contains examples and utilities for performing distributed deep learning training on Databricks using various frameworks with PyTorch. The examples focus on image classification tasks using popular datasets and model architectures. This is NOT an official Databricks repository.

Overview

Training deep learning models at scale requires distributed training capabilities. This repository demonstrates how to leverage different distributed training frameworks on Databricks to accelerate model training for computer vision tasks.

Cluster Config:

Confirmed for DBR:

15.4.x-gpu-ml-scala2.12

Cluster Spec used:

g5.24xlarge [A10G]
384 GB Memory, 4 GPUs

GPU Utilization Example

We can split our training across multiple GPUs and even multiple nodes with GPUs to expedite training.

Supported Datasets

CIFAR-10/100
TinyImageNet
ImageNet-1K

Model Architectures

ResNet18
ResNet50

Frameworks

This repository includes examples for the following distributed training frameworks:

PyTorch Distributor - Native PyTorch distributed training with Databricks' Torch Distributor
DeepSpeed - Microsoft's deep learning optimization library
Composer - MosaicML's training library for efficient deep learning
Accelerate - Hugging Face's library for easy distributed training
Ray - Distributed computing framework with PyTorch integration

Repository Structure

.
├── 01_torch_distributor/  # Examples using PyTorch's native distributed capabilities with Torch Distributor
├── 02_deepspeed/          # Examples using Microsoft DeepSpeed
├── 03_composer/           # Examples using MosaicML Composer
├── 04_accelerate/         # Examples using Hugging Face Accelerate
├── 05_ray/                # Examples using Ray for distributed training
├── setup/                 # Setup scripts and utilities for Databricks clusters
├── utils/                 # Common utility functions for data loading, metrics, etc.
├── .gitignore             # Git ignore file
├── LICENSE                # License information
└── requirements.txt       # Python package dependencies

Getting Started

Prerequisites

Databricks Runtime for Machine Learning (DBR ML) 15.4 or later
GPU-enabled Databricks cluster

Installation

Clone this repository to your Databricks workspace:

git clone https://github.com/username/distributed-pytorch-databricks.git

Install the required dependencies:

pip install -r requirements.txt

Configure your Databricks cluster using the setup scripts provided in the setup/ directory.

Usage Examples

Each framework directory contains notebooks and scripts that demonstrate:

Data loading and preprocessing for the supported datasets
Model definition and configuration
Distributed training setup and execution
Evaluation and metrics tracking
Integration with Databricks MLflow for experiment tracking

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the terms specified in the LICENSE file.

Acknowledgements

PyTorch team
Microsoft DeepSpeed
MosaicML Composer
Hugging Face Accelerate
Ray Project
Databricks for their ML runtime and infrastructure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed PyTorch Training on Databricks

Overview

Cluster Config:

GPU Utilization Example

Supported Datasets

Model Architectures

Frameworks

Repository Structure

Getting Started

Prerequisites

Installation

Usage Examples

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.databricks		.databricks
01_torch_distributor		01_torch_distributor
02_deepspeed		02_deepspeed
03_composer		03_composer
04_accelerate		04_accelerate
05_ray		05_ray
assets		assets
setup		setup
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
UPDATE_local_config.yaml		UPDATE_local_config.yaml
requirements.txt		requirements.txt

License

willsmithDB/dbx_distributed_pytorch_examples

Folders and files

Latest commit

History

Repository files navigation

Distributed PyTorch Training on Databricks

Overview

Cluster Config:

GPU Utilization Example

Supported Datasets

Model Architectures

Frameworks

Repository Structure

Getting Started

Prerequisites

Installation

Usage Examples

Contributing

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages