Skip to content
Merged
Show file tree
Hide file tree
Changes from 54 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
e09d6cf
GPU-accelerated Spark Connect: Spark SQL + Spark MLLib
gerashegalov Sep 12, 2025
22b9733
No real need to persist notebook
gerashegalov Sep 13, 2025
437434f
Update Spark Connect example and Docker configuration
gerashegalov Sep 14, 2025
02238fb
Enhance Spark Connect example with new configurations and schema files
gerashegalov Sep 14, 2025
43a0dee
Update Spark Connect example with new data input path and notebook en…
gerashegalov Sep 14, 2025
fb998fc
Update Spark Connect example with enhanced configurations and noteboo…
gerashegalov Sep 14, 2025
4c7be97
Enhance Spark Connect demo notebook with ML modeling phase
gerashegalov Sep 14, 2025
fe96979
Refactor Spark Connect demo notebook and update Dockerfile
gerashegalov Sep 14, 2025
72f1267
Enhance Spark Connect server configuration and demo notebook
gerashegalov Sep 14, 2025
25dc328
Refactor Spark Connect Dockerfile and update demo notebook
gerashegalov Sep 14, 2025
6b3b747
Update Spark Connect Dockerfile and configuration for GPU support
gerashegalov Sep 15, 2025
37d2e6c
Enhance Spark Connect demo notebook and Dockerfiles for improved ML e…
gerashegalov Sep 15, 2025
d83192d
Refactor Spark Connect configuration and enhance demo notebook for im…
gerashegalov Sep 15, 2025
e34bf10
Copyright check
gerashegalov Sep 15, 2025
f3b6f1d
Add configs to README
gerashegalov Sep 15, 2025
5bb73f2
Add an entry to the main README
gerashegalov Sep 15, 2025
5b84f46
Adjust notebook name
gerashegalov Sep 15, 2025
cf3dbf1
Update Spark Connect configurations and enhance demo notebook
gerashegalov Sep 16, 2025
effc7eb
Reorganize requirements.txt files for Spark Connect server and worker
gerashegalov Sep 16, 2025
9ec75a3
Reviews and NVdashboard on spark-worker
gerashegalov Sep 17, 2025
150cd2a
Review iteration
gerashegalov Sep 17, 2025
5696636
remove nvdashboard
gerashegalov Sep 20, 2025
3bd9a6a
Update requirements and environment configuration for Spark Connect
gerashegalov Oct 2, 2025
563301e
Remove irrelevant files from index
gerashegalov Oct 7, 2025
00d078b
address comments including making UI links navigable over ssh tunnel
eordentlich Oct 13, 2025
80de902
cleanup + correct UI pointers for proxy
eordentlich Oct 13, 2025
ab275c1
no need for worker and ui port forward with UI reverse proxy
eordentlich Oct 13, 2025
1f90c81
Add SOCKS5 proxy service to Docker Compose and update environment con…
gerashegalov Oct 14, 2025
0272f9c
Remove GPU and CPU demo notebooks for Spark Connect
gerashegalov Oct 14, 2025
3600f46
Update spark-env.sh to set SPARK_PUBLIC_DNS dynamically
gerashegalov Oct 14, 2025
dc0ad2b
Enhance Docker Compose setup with NGINX proxy and volume configuration
gerashegalov Oct 15, 2025
01645da
Remove unnecessary whitespace in nginx.conf for cleaner configuration
gerashegalov Oct 15, 2025
dbe8307
http_proxy
gerashegalov Oct 15, 2025
8787ba7
Enhance Docker Compose configuration for Spark Connect
gerashegalov Oct 15, 2025
e5491a1
Refactor NGINX proxy service in Docker Compose setup
gerashegalov Oct 15, 2025
aa3c11a
Refactor NGINX configuration for proxy service
gerashegalov Oct 16, 2025
965ed8f
Merge remote-tracking branch 'eordentlich/issue543_eo' into gerashega…
gerashegalov Oct 16, 2025
26be358
Update NGINX configuration to use $http_host for Host header
gerashegalov Oct 16, 2025
5e4556c
Update README.md to change WORK_DIR permissions from 777 to 1777 for …
gerashegalov Oct 17, 2025
b6c8e19
Update README and demo notebook for improved Spark Connect access and…
gerashegalov Oct 17, 2025
0f40360
Update README and configuration files for clarity and licensing
gerashegalov Oct 17, 2025
d78d1a3
Add GPU resource configuration to spark-defaults.conf
gerashegalov Oct 17, 2025
349a601
Enhance Docker Compose and configuration for Spark Connect setup
gerashegalov Oct 18, 2025
e0c0f21
Remove outdated CSV file sizes from README.md for clarity
gerashegalov Oct 18, 2025
84ad509
Add spark.executor.cores configuration to spark-defaults.conf
gerashegalov Oct 18, 2025
ef20371
Add shuffle manager configuration to spark-defaults.conf
gerashegalov Oct 18, 2025
86b55a6
Update spark-defaults.conf to include locality wait configuration
gerashegalov Oct 19, 2025
fb82d55
Enhance Spark Connect configuration for improved performance
gerashegalov Oct 19, 2025
63e9b9d
Refactor Spark Connect configuration and documentation
gerashegalov Oct 20, 2025
5a8e6ab
Enhance Spark Connect configuration with dynamic versioning and permi…
gerashegalov Oct 20, 2025
0a5bbde
Update requirements.txt to replace pyspark[connect] with pyspark-client
gerashegalov Oct 20, 2025
4dd9991
Enhance spark-connect-demo notebook with data directory configuration
gerashegalov Oct 21, 2025
aadc33e
Refactor spark-connect-demo notebook for improved path management and…
gerashegalov Oct 21, 2025
ddf7181
Enhance Spark Connect configuration and documentation
gerashegalov Oct 22, 2025
be4ecd5
Add example acceleration chart and update README for GPU demo
gerashegalov Oct 22, 2025
16299d4
Apply suggestion from @rishic3
gerashegalov Oct 23, 2025
3e5cdc8
Merge remote-tracking branch 'gerashegalov/gerashegalov/issue543' int…
gerashegalov Oct 23, 2025
1bd6d4b
Update README and spark-connect-demo notebook for improved clarity an…
gerashegalov Oct 23, 2025
99879e8
Merge remote-tracking branch 'origin/main' into gerashegalov/issue543
gerashegalov Oct 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ Here is the list of notebooks in this repo:
| 5 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
| 6 | ML/DL | PCA | [Spark-Rapids-ML](https://github.com/NVIDIA/spark-rapids-ml) based PCA example to train and transform with a synthetic dataset
| 7 | ML/DL | DL Inference | 11 notebooks demonstrating distributed model inference on Spark using the `predict_batch_udf` across various frameworks: PyTorch, HuggingFace, and TensorFlow
| 8 | SQL/DF + MLlib | GPU-Accelerated Spark Connect | End-to-end SQL/DF + MLlib acceleration to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) using the lightweight Spark Connect integration for Apache Spark 4.0+
| 9 | SQL/DF | [TPC-DS](https://www.tpc.org/tpcds/) Scale Factor 10 | Comparison of Spark SQL CPU vs GPU. Easy to run locally anb on Google Colab

Here is the list of Apache Spark applications (Scala and PySpark) that
can be built for running on GPU with RAPIDS Accelerator in this repo:
Expand Down
288 changes: 288 additions & 0 deletions examples/spark-connect-for-etl-and-ml/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,288 @@
# GPU-Accelerated Spark Connect for ETL and ML (Spark 4.0)

This project demonstrates a complete GPU-accelerated ETL and Machine Learning pipeline using Apache Spark 4.0 with Spark Connect, featuring the RAPIDS Accelerator. The example showcases the capabilities presented in the Data and AI Summit 2025 session:
[GPU Accelerated Spark Connect](https://www.databricks.com/dataaisummit/session/gpu-accelerated-spark-connect).
It is similar to the XGBoost example in this repo.
The key difference is that it uses Spark Connect thus the notebook server node has no heavy dependencies and it uses
LogisticRegression to demonstrate accelerated Spark MLlib functionality

## 🚀 Key Features

- **Apache Spark 4.0** with cutting-edge Spark Connect capabilities
- **GPU acceleration** via RAPIDS Accelerator
- **MLlib over Spark Connect** - new in Spark 4.0
- **Zero-code-change acceleration** - existing Spark applications automatically benefit
- **Complete ETL and ML pipeline** demonstration with mortgage data
- **Jupyter Lab integration** for interactive development
- **Docker Compose** setup for easy deployment with clear distinction what dependencies are
required by what service and where GPUs are really used

## 📊 Performance Highlights

The included demonstration shows:
- **Comprehensive ETL pipeline** processing mortgage data with complex transformations for feature engineering
- **Machine Learning workflow** using Logistic Regression with Feature Hashing
- **GPU vs CPU performance comparison** with visualization of the speedup achieved on the hardware the demo is run

## 🏗️ Architecture

The setup consists of four Docker services:

1. **Spark Master** (`spark-master`) - Cluster coordination and job scheduling
2. **Spark Worker** (`spark-worker`) - GPU-enabled worker node for task execution.
3. **Spark Connect Server** (`spark-connect-server`) - gRPC interface with RAPIDS integration
4. **Jupyter Lab - Spark Connect Client** (`spark-connect-client`) - Interactive development environment

## 📋 Prerequisites

### Required
- [Docker](https://docs.docker.com/engine/install/) and [Docker Compose](https://docs.docker.com/compose/install/linux)
- At least 8GB of available RAM
- Available ports: 2080, 8080, 8081, 8888, 7077, 4040, 15002

### For GPU Acceleration
- NVIDIA GPU with CUDA compute capability supported by RAPIDS
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
- Docker Compose version should be `2.30.x` or newer to avoid an NVIDIA Container Toolkit related bug. [Update](https://docs.docker.com/compose/install/linux) if necessary
- CUDA 12.x drivers

## 🚀 Quick Start

1. **Clone and navigate to the project:**
```bash
cd examples/spark-connect-for-etl-and-ml
```

2. **Set up data directory (if needed):**
```bash
export DATA_DIR=$(pwd)/data
mkdir -p $DATA_DIR/mortgage.input.csv $DATA_DIR/spark-events
chmod 1777 $DATA_DIR $DATA_DIR/spark-events

```
Download a few quarters worth of the [Mortgage Dataset](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
to the `$DATA_DIR/mortgage.input.csv` location. The demo at the Data+AI Summit'25 used the following quarters

```bash
$ du -h *
503M 2023Q1.csv
412M 2023Q2.csv
162M 2023Q3.csv
1.1G 2023Q4.csv


```

3. **Start all services:**


```bash
$ docker compose up -d
```

(`docker compose` can be used in place of `docker-compose` here and throughout)

4. **Access the Web UI interfaces:**

***Option 1 (default)***

All containers' webUI are available using localhost URI's by default

- **Jupyter Lab**: http://localhost:8888 (no password required) - Interactive notebook environment
- **Spark Master UI**: http://localhost:8080 - Cluster coordination and resource management
- **Spark Worker UI**: http://localhost:8081 - GPU-enabled worker node status and tasks
- **Spark Driver UI**: http://localhost:4040 - Application monitoring and SQL queries

***Option 2***

if you launch docker compose in the environment with SPARK_PUBLIC_DNS=container-hostname, all containers'
web UI but Jupyter Lab is available using the corresponding container host names such as spark-master

- **Jupyter Lab**: http://localhost:8888 (no password required) - Interactive notebook environment
- **Spark Master UI**: http://spark-master:8080 - Cluster coordination and resource management
- **Spark Worker UI**: http://spark-worker:8081 - GPU-enabled worker node status and tasks
- **Spark Driver UI**: http://spark-connect-server:4040 - Application monitoring and SQL queries

Docker DNS names require configuring your browser an http proxy on the Docker network exposed at
http://localhost:2080.

Here are examples of launching Google Chrome with a temporary user profile without making persistent changes on the browser

***Linux***

```bash
$ google-chrome --user-data-dir="/tmp/chrome-proxy-profile" --proxy-server="http=http://localhost:2080"
```

***macOS***

```bash
$ open -n -a "Google Chrome" --args --user-data-dir="/tmp/chrome-proxy-profile" --proxy-server="http=http://localhost:2080"
```

***Launching containers on a remote machine***

Your local machine might not have a GPU, and it is common in this case to use a
remote machine/cluster with GPUs residing in a remote Cloud or on-prem environment

If you followed the default Option 1 make sure to create local port forwards for
every webUI port

```bash
ssh <user@gpu-host> -L 8888:localhost:8888 -L 8888:localhost:8080 -L 8081:localhost:8081 -L 4040:localhost:4040
```

if you used Option 2 it is sufficient to forward ports only for the HTTP proxy and the Notebook app:

```bash
ssh <user@gpu-host> -L 2080:localhost:2080 -L 8888:localhost:8888
```


5. **Open the demo notebook:**
- Navigate to `work/spark-connect-demo.ipynb` in Jupyter Lab
- You can also open it in VS Code by selecting http://localhost:8888 as the
existing notebook server connection
- Run the complete ETL and ML pipeline demonstration


## 📝 Demo Notebook Overview

The `spark-connect-demo.ipynb` notebook demonstrates:

### ETL Pipeline
- **Data ingestion** from CSV with custom schema
- **Complex transformations** including date parsing and delinquency calculations
- **String-to-numeric encoding** for categorical features
- **Data joins and aggregations** with mortgage performance data

### Machine Learning Workflow
- **Feature engineering** with FeatureHasher and VectorAssembler
- **Logistic Regression** training for multi-class prediction
- **Model evaluation** with performance metrics
- **GPU vs CPU timing comparisons**

### Key Code Examples

**Connecting to Spark with GPU acceleration:**
```python
from pyspark.sql import SparkSession

spark = (
SparkSession.builder
.remote('sc://spark-connect-server')
.appName('GPU-Accelerated-ETL-ML-Demo')
.getOrCreate()
)
```

**GPU acceleration test:**
```python
spark.conf.set('spark.rapids.sql.enabled', True)
df = (
spark.range(2 ** 35)
.withColumn('mod10', col('id') % lit(10))
.groupBy('mod10').agg(count('*'))
.orderBy('mod10')
)
df.explain(mode='extended') # Shows GPU operations in physical plan
```

**Machine Learning with GPU acceleration:**
```python
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler, FeatureHasher

spark.conf.set('spark.connect.ml.backend.classes', 'com.nvidia.rapids.ml.Plugin')

# Feature preparation
hasher = FeatureHasher(inputCols=categorical_cols, outputCol='hashed_categorical')
assembler = VectorAssembler().setInputCols(numerical_cols + ['hashed_categorical']).setOutputCol('features')

# Model training
logistic = LogisticRegression().setFeaturesCol('features').setLabelCol('delinquency_12')
pipeline = Pipeline().setStages([hasher, assembler, logistic])
model = pipeline.fit(training_data)
```

### Results

This demo was tested on a machine with a 24GiB Quadro RTX 600

```bash
$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro RTX 6000 Off | 00000000:01:00.0 Off | Off |
| 33% 25C P8 7W / 260W | 10354MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
```

and a 64-vcore CPU

## 🐳 Service Details

### Spark Master
- **Image**: `apache/spark:4.0.0`
- **Ports**: 8080 (Web UI), 7077 (Master)
- **Role**: Cluster coordination and resource management

### Spark Worker (the only GPU node role)
- **Image**: Custom build based on `apache/spark:4.0.0`
- **GPU**: NVIDIA GPU support via Docker Compose deploy configuration
- **Ports**: 8081 (Web UI)
- **Features**: GPU resource discovery and task execution

### Spark Connect Server
- **Image**: Custom build based on `apache/spark:4.0.0` with Spark RAPIDS ETL and ML Plugins
- **RAPIDS Version**: 25.08.0 for CUDA 12
- **Ports**: 15002 (gRPC), 4040 (Driver UI)
- **Configuration**: Optimized for GPU acceleration with memory management

### JupyterLab - Spark Connect Client
- **Image**: Based on `jupyter/minimal-notebook:latest`
- **Environment**: Pre-configured with PySpark Connect Client
- **Ports**: 8888 (Jupyter Lab)
- **Volumes**: Notebooks and work directory mounted

## 📊 Performance Monitoring

You can use tools like nvtop, nvitop, btop or jupyterlab_nvdashboard running on the GPU host(s)


## 🧹 Cleanup

Stop and remove all services:
```bash
docker-compose down -v
```

Remove built images:
```bash
docker-compose down --rmi all -v
```

### Logs
Logs for the spark driver/connect server, standalone master, standalone worker, and jupyter server can be viewed using the respective commands:
```bash
docker logs spark-connect-server
docker logs spark-master
docker logs spark-worker
docker logs spark-connect-client
```

Spark executor logs can be accessed via the Spark UI as usual.

## 📖 Additional Resources

- [Apache Spark 4.0 Documentation](https://spark.apache.org/docs/latest/)
- [Spark Connect Guide](https://spark.apache.org/docs/latest/spark-connect-overview.html)
- [NVIDIA RAPIDS Accelerator](https://nvidia.github.io/spark-rapids/)
- [Data and AI Summit Session](https://www.databricks.com/dataaisummit/session/gpu-accelerated-spark-connect)
Loading