Distributed Model Inference Examples on Databricks with Huggingface and Accelerate

This repository provides tools and scripts for performing distributed model inference on Databricks using Huggingface and Accelerate. The focus is on leveraging data parallelism and model parallelism to efficiently utilize resources.

This is not official Databricks documentation nor official assets.

Models Used

Meta-Llama-3-8B-Instruct split across 2 V100 GPUs [32GB VRAM each] from Huggingface
Llama-3.3-70B-Instruct split across 2 A100 GPUs [80GB VRAM each] from Huggingface
Meta-Llama-3-8B-Instruct loaded into 1 V100 with 8-bit quantization from Huggingface

Data Used

MedQuad dataset: keivalya/MedQuad-MedicalQnADataset from Huggingface Datasets

Concepts

Data Parallelism

Data parallelism involves splitting the input data across multiple devices, where each device processes a portion of the data independently. This approach is beneficial for scaling inference tasks as it allows simultaneous processing of data batches.

Model Parallelism

Model parallelism involves splitting the model itself across multiple devices. This is useful for large models that cannot fit into the memory of a single device. By distributing the model, each device handles a portion of the model's layers or operations.

You can alsp have pipeline parallelism where we can end in as many inputs as we have GPUs whereas each part of will be working on their particular chunk. This results in more efficient usage of the GPUs rather than idling for certain model chunks during generation.

For more details, refer to the Huggingface Accelerate Distributed Inference Guide.

Configuration

Before running the scripts, ensure you update the config.yaml file with the appropriate settings for your environment.

Cluster Configurations

General Cluster Config

DBR Version: 16.3 ML: 16.3.x-gpu-ml-scala2.12
Instance Type: Standard_NC48ads_A100_v4 [A100]
Memory: 440GB
GPUs: 2
VRAM: 80GB x 2 GPUs

Cluster Config for Llama 8b

DBR Version: 16.3 ML: 16.3.x-gpu-ml-scala2.12
Instance Type: Standard_NC12s_v3 [V100]
Memory: 224GB
GPUs: 2
VRAM: 16GB x 2 GPUs

Ensure your cluster is configured according to the specifications above to achieve optimal performance.

License

This project is licensed under the Apache License.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.databricks		.databricks
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Model Inference Examples on Databricks with Huggingface and Accelerate

Models Used

Data Used

Concepts

Data Parallelism

Model Parallelism

Configuration

Cluster Configurations

Cluster Configurations

General Cluster Config

Cluster Config for Llama 8b

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

willsmithDB/dbx_inference_examples

Folders and files

Latest commit

History

Repository files navigation

Distributed Model Inference Examples on Databricks with Huggingface and Accelerate

Models Used

Data Used

Concepts

Data Parallelism

Model Parallelism

Configuration

Cluster Configurations

Cluster Configurations

General Cluster Config

Cluster Config for Llama 8b

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages