NVIDIA · wbo4958 · Oct 24, 2025 · Sep 12, 2025 · Sep 13, 2025 · Sep 14, 2025
diff --git a/README.md b/README.md
@@ -25,6 +25,8 @@ Here is the list of notebooks in this repo:
 | 5 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
 | 6 | ML/DL | PCA | [Spark-Rapids-ML](https://github.com/NVIDIA/spark-rapids-ml) based PCA example to train and transform with a synthetic dataset
 | 7 | ML/DL | DL Inference | 11 notebooks demonstrating distributed model inference on Spark using the `predict_batch_udf` across various frameworks: PyTorch, HuggingFace, and TensorFlow
+| 8 | SQL/DF + MLlib | GPU-Accelerated Spark Connect | End-to-end SQL/DF + MLlib acceleration to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) using the lightweight Spark Connect integration for Apache Spark 4.0+
+| 9 | SQL/DF | [TPC-DS](https://www.tpc.org/tpcds/) Scale Factor 10 | Comparison of Spark SQL CPU vs GPU. Easy to run locally anb on Google Colab
 
 Here is the list of Apache Spark applications (Scala and PySpark) that 
 can be built for running on GPU with RAPIDS Accelerator in this repo:

diff --git a/examples/spark-connect-for-etl-and-ml/README.md b/examples/spark-connect-for-etl-and-ml/README.md
@@ -0,0 +1,288 @@
+# GPU-Accelerated Spark Connect for ETL and ML (Spark 4.0)
+
+This project demonstrates a complete GPU-accelerated ETL and Machine Learning pipeline using Apache Spark 4.0 with Spark Connect, featuring the RAPIDS Accelerator. The example showcases the capabilities presented in the Data and AI Summit 2025 session:
+[GPU Accelerated Spark Connect](https://www.databricks.com/dataaisummit/session/gpu-accelerated-spark-connect).
+It is similar to the XGBoost example in this repo.
+The key difference is that it uses Spark Connect thus the notebook server node has no heavy dependencies and it uses
+LogisticRegression to demonstrate accelerated Spark MLlib functionality
+
+## 🚀 Key Features
+
+- **Apache Spark 4.0** with cutting-edge Spark Connect capabilities
+- **GPU acceleration** via RAPIDS Accelerator
+- **MLlib over Spark Connect** - new in Spark 4.0
+- **Zero-code-change acceleration** - existing Spark applications automatically benefit
+- **Complete ETL and ML pipeline** demonstration with mortgage data
+- **Jupyter Lab integration** for interactive development
+- **Docker Compose** setup for easy deployment with clear distinction what dependencies are
+required by what service and where GPUs are really used
+
+## 📊 Performance Highlights
+
+The included demonstration shows:
+- **Comprehensive ETL pipeline** processing mortgage data with complex transformations for feature engineering
+- **Machine Learning workflow** using Logistic Regression with Feature Hashing
+- **GPU vs CPU performance comparison** with visualization of the speedup achieved on the hardware the demo is run
+
+## 🏗️ Architecture
+
+The setup consists of four Docker services:
+
+1. **Spark Master** (`spark-master`) - Cluster coordination and job scheduling
+2. **Spark Worker** (`spark-worker`) - GPU-enabled worker node for task execution. 
+3. **Spark Connect Server** (`spark-connect-server`) - gRPC interface with RAPIDS integration
+4. **Jupyter Lab - Spark Connect Client** (`spark-connect-client`) - Interactive development environment
+
+## 📋 Prerequisites
+
+### Required
+- [Docker](https://docs.docker.com/engine/install/) and [Docker Compose](https://docs.docker.com/compose/install/linux)
+- At least 8GB of available RAM
+- Available ports: 2080, 8080, 8081, 8888, 7077, 4040, 15002
+
+### For GPU Acceleration
+- NVIDIA GPU with CUDA compute capability supported by RAPIDS
+- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
+- Docker Compose version should be `2.30.x` or newer to avoid an NVIDIA Container Toolkit related bug.  [Update](https://docs.docker.com/compose/install/linux) if necessary
+- CUDA 12.x drivers
+
+## 🚀 Quick Start
+
+1. **Clone and navigate to the project:**
+   ```bash
+   cd examples/spark-connect-for-etl-and-ml
+   ```
+
+2. **Set up data directory (if needed):**
+   ```bash
+   export DATA_DIR=$(pwd)/data
+   mkdir -p $DATA_DIR/mortgage.input.csv $DATA_DIR/spark-events 
+   chmod 1777 $DATA_DIR $DATA_DIR/spark-events 
+
+   ```
+   Download a few quarters worth of the [Mortgage Dataset](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
+   to the `$DATA_DIR/mortgage.input.csv` location. The demo at the Data+AI Summit'25 used the following quarters
+
+```bash
+$ du -h *
+503M    2023Q1.csv
+412M    2023Q2.csv
+162M    2023Q3.csv
+1.1G    2023Q4.csv
+
+
+```
+
+3. **Start all services:**
+
+
+```bash
+$ docker compose up -d
+```
+
+   (`docker compose` can be used in place of `docker-compose` here and throughout)
+
+4. **Access the Web UI interfaces:**
+
+  ***Option 1 (default)***
+
+  All containers' webUI are available using localhost URI's by default
+
+   - **Jupyter Lab**: http://localhost:8888 (no password required) - Interactive notebook environment
+   - **Spark Master UI**: http://localhost:8080 - Cluster coordination and resource management
+   - **Spark Worker UI**: http://localhost:8081 - GPU-enabled worker node status and tasks
+   - **Spark Driver UI**: http://localhost:4040 - Application monitoring and SQL queries
+
+  ***Option 2***
+
+  if you launch docker compose in the environment with SPARK_PUBLIC_DNS=container-hostname, all containers'
+  web UI but Jupyter Lab is available using the corresponding container host names such as spark-master
+
+   - **Jupyter Lab**: http://localhost:8888 (no password required) - Interactive notebook environment
+   - **Spark Master UI**: http://spark-master:8080 - Cluster coordination and resource management
+   - **Spark Worker UI**: http://spark-worker:8081 - GPU-enabled worker node status and tasks
+   - **Spark Driver UI**: http://spark-connect-server:4040 - Application monitoring and SQL queries
+
+  Docker DNS names require configuring your browser an http proxy on the Docker network exposed at
+  http://localhost:2080. 
+
+  Here are examples of launching Google Chrome with a temporary user profile without making persistent changes on the browser 
+
+  ***Linux***
+
+  ```bash
+  $ google-chrome --user-data-dir="/tmp/chrome-proxy-profile" --proxy-server="http=http://localhost:2080"
+  ```
+
+  ***macOS***
+
+  ```bash
+  $ open -n -a "Google Chrome" --args --user-data-dir="/tmp/chrome-proxy-profile" --proxy-server="http=http://localhost:2080"
+  ```
+
+  ***Launching containers on a remote machine***
+
+  Your local machine might not have a GPU, and it is common in this case to use a 
+  remote machine/cluster with GPUs residing in a remote Cloud or on-prem environment
+
+  If you followed the default Option 1 make sure to create local port forwards for
+  every webUI port
+
+  ```bash
+  ssh <user@gpu-host> -L 8888:localhost:8888 -L 8888:localhost:8080 -L 8081:localhost:8081 -L 4040:localhost:4040 
+  ```
+
+  if you used Option 2 it is sufficient to forward ports only for the HTTP proxy and the Notebook app:
+
+  ```bash
+  ssh <user@gpu-host> -L 2080:localhost:2080 -L 8888:localhost:8888 
+  ```
+
+
+5. **Open the demo notebook:**
+   - Navigate to `work/spark-connect-demo.ipynb` in Jupyter Lab
+   - You can also open it in VS Code by selecting http://localhost:8888 as the
+     existing notebook server connection
+   - Run the complete ETL and ML pipeline demonstration
+
+
+## 📝 Demo Notebook Overview
+
+The `spark-connect-demo.ipynb` notebook demonstrates:
+
+### ETL Pipeline
+- **Data ingestion** from CSV with custom schema
+- **Complex transformations** including date parsing and delinquency calculations
+- **String-to-numeric encoding** for categorical features
+- **Data joins and aggregations** with mortgage performance data
+
+### Machine Learning Workflow
+- **Feature engineering** with FeatureHasher and VectorAssembler
+- **Logistic Regression** training for multi-class prediction
+- **Model evaluation** with performance metrics
+- **GPU vs CPU timing comparisons**
+
+### Key Code Examples
+
+**Connecting to Spark with GPU acceleration:**
+```python
+from pyspark.sql import SparkSession
+
+spark = (
+  SparkSession.builder
+    .remote('sc://spark-connect-server')
+    .appName('GPU-Accelerated-ETL-ML-Demo')
+    .getOrCreate()
+)
+```
+
+**GPU acceleration test:**
+```python
+spark.conf.set('spark.rapids.sql.enabled', True)
+df = (
+  spark.range(2 ** 35)
+    .withColumn('mod10', col('id') % lit(10))
+    .groupBy('mod10').agg(count('*'))
+    .orderBy('mod10')
+)
+df.explain(mode='extended')  # Shows GPU operations in physical plan
+```
+
+**Machine Learning with GPU acceleration:**
+```python
+from pyspark.ml import Pipeline
+from pyspark.ml.classification import LogisticRegression
+from pyspark.ml.feature import VectorAssembler, FeatureHasher
+
+spark.conf.set('spark.connect.ml.backend.classes', 'com.nvidia.rapids.ml.Plugin')
+
+# Feature preparation
+hasher = FeatureHasher(inputCols=categorical_cols, outputCol='hashed_categorical')
+assembler = VectorAssembler().setInputCols(numerical_cols + ['hashed_categorical']).setOutputCol('features')
+
+# Model training
+logistic = LogisticRegression().setFeaturesCol('features').setLabelCol('delinquency_12')
+pipeline = Pipeline().setStages([hasher, assembler, logistic])
+model = pipeline.fit(training_data)
+```
+
+### Results 
+
+This demo was tested on a machine with a 24GiB Quadro RTX 600
+
+```bash
+$ nvidia-smi
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
+|-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  Quadro RTX 6000                Off |   00000000:01:00.0 Off |                  Off |
+| 33%   25C    P8              7W /  260W |   10354MiB /  24576MiB |      0%      Default |
+|                                         |                        |                  N/A |
++-----------------------------------------+------------------------+----------------------+
+```
+
+and a 64-vcore CPU
+
+## 🐳 Service Details
+
+### Spark Master
+- **Image**: `apache/spark:4.0.0`
+- **Ports**: 8080 (Web UI), 7077 (Master)
+- **Role**: Cluster coordination and resource management
+
+### Spark Worker (the only GPU node role)
+- **Image**: Custom build based on `apache/spark:4.0.0`
+- **GPU**: NVIDIA GPU support via Docker Compose deploy configuration
+- **Ports**: 8081 (Web UI)
+- **Features**: GPU resource discovery and task execution
+
+### Spark Connect Server
+- **Image**: Custom build based on `apache/spark:4.0.0` with Spark RAPIDS ETL and ML Plugins
+- **RAPIDS Version**: 25.08.0 for CUDA 12
+- **Ports**: 15002 (gRPC), 4040 (Driver UI)
+- **Configuration**: Optimized for GPU acceleration with memory management
+
+### JupyterLab - Spark Connect Client
+- **Image**: Based on `jupyter/minimal-notebook:latest`
+- **Environment**: Pre-configured with PySpark Connect Client
+- **Ports**: 8888 (Jupyter Lab)
+- **Volumes**: Notebooks and work directory mounted
+
+## 📊 Performance Monitoring
+
+You can use tools like nvtop, nvitop, btop or jupyterlab_nvdashboard running on the GPU host(s)
+
+
+## 🧹 Cleanup
+
+Stop and remove all services:
+```bash
+docker-compose down -v
+```
+
+Remove built images:
+```bash
+docker-compose down --rmi all -v
+```
+
+### Logs
+Logs for the spark driver/connect server, standalone master, standalone worker, and jupyter server can be viewed using the respective commands:
+```bash
+docker logs spark-connect-server
+docker logs spark-master
+docker logs spark-worker
+docker logs spark-connect-client
+```
+
+Spark executor logs can be accessed via the Spark UI as usual.
+
+## 📖 Additional Resources
+
+- [Apache Spark 4.0 Documentation](https://spark.apache.org/docs/latest/)
+- [Spark Connect Guide](https://spark.apache.org/docs/latest/spark-connect-overview.html)
+- [NVIDIA RAPIDS Accelerator](https://nvidia.github.io/spark-rapids/)
+- [Data and AI Summit Session](https://www.databricks.com/dataaisummit/session/gpu-accelerated-spark-connect)