NVIDIA · wbo4958 · Oct 24, 2025 · Sep 12, 2025 · Sep 13, 2025 · Sep 14, 2025
diff --git a/README.md b/README.md
@@ -25,6 +25,8 @@ Here is the list of notebooks in this repo:
 | 5 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
 | 6 | ML/DL | PCA | [Spark-Rapids-ML](https://github.com/NVIDIA/spark-rapids-ml) based PCA example to train and transform with a synthetic dataset
 | 7 | ML/DL | DL Inference | 11 notebooks demonstrating distributed model inference on Spark using the `predict_batch_udf` across various frameworks: PyTorch, HuggingFace, and TensorFlow
+| 8 | SQL/DF + MLlib | GPU-Accelerated Spark Connect | End-to-end SQL/DF + MLlib acceleration to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) using the lightweight Spark Connect integration for Apache Spark 4.0+
+| 9 | SQL/DF | [TPC-DS](https://www.tpc.org/tpcds/) Scale Factor 10 | Comparison of Spark SQL CPU vs GPU. Easy to run locally anb on Google Colab
 
 Here is the list of Apache Spark applications (Scala and PySpark) that 
 can be built for running on GPU with RAPIDS Accelerator in this repo:

diff --git a/examples/spark-connect-for-etl-and-ml/README.md b/examples/spark-connect-for-etl-and-ml/README.md
@@ -0,0 +1,202 @@
+# GPU-Accelerated Spark Connect for ETL and ML (Spark 4.0)
+
+This project demonstrates a complete GPU-accelerated ETL and Machine Learning pipeline using Apache Spark 4.0 with Spark Connect, featuring the RAPIDS Accelerator. The example showcases the capabilities presented in the Data and AI Summit 2025 session:
+[GPU Accelerated Spark Connect](https://www.databricks.com/dataaisummit/session/gpu-accelerated-spark-connect).
+It is similar to the XGBoost example in this repo.
+The key difference is that it uses Spark Connect thus the notebook server node has no heavy dependencies and it uses
+LogisticRegression to demonstrate accelerated Spark MLlib functionality
+
+## 🚀 Key Features
+
+- **Apache Spark 4.0** with cutting-edge Spark Connect capabilities
+- **GPU acceleration** via RAPIDS Accelerator (up to 9x performance improvement)
+- **MLlib over Spark Connect** - new in Spark 4.0
+- **Zero-code-change acceleration** - existing Spark applications automatically benefit
+- **Complete ETL and ML pipeline** demonstration with mortgage data
+- **Jupyter Lab integration** for interactive development
+- **Docker Compose** setup for easy deployment with clear distinction what dependencies are
+required by what service and where GPUs are really used
+
+## 📊 Performance Highlights
+
+The included demonstration shows:
+- **Comprehensive ETL pipeline** processing mortgage data with complex transformations for feature engineering
+- **Machine Learning workflow** using Logistic Regression with Feature Hashing
+- **GPU vs CPU performance comparison** with visualization of the speedup achieved on the hardware the demo is run
+
+## 🏗️ Architecture
+
+The setup consists of four Docker services:
+
+1. **Spark Master** (`spark-master`) - Cluster coordination and job scheduling
+2. **Spark Worker** (`spark-worker`) - GPU-enabled worker node for task execution
+3. **Spark Connect Server** (`spark-connect-server`) - gRPC interface with RAPIDS integration
+4. **Jupyter Lab Client** (`spark-connect-client`) - Interactive development environment
+
+## 📋 Prerequisites
+
+### Required
+- Docker and Docker Compose
+- At least 8GB of available RAM
+- Available ports: 8080, 8081, 8888, 7077, 4040, 15002
+
+### For GPU Acceleration
+- NVIDIA GPU with CUDA compute capability supported by RAPIDS
+- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
+- CUDA 12.x drivers
+
+## 🚀 Quick Start
+
+1. **Clone and navigate to the project:**
+   ```bash
+   cd spark-connect-for-etl-and-ml
+   ```
+
+2. **Set up data directory (if needed):**
+   ```bash
+   export WORK_DIR=$(pwd)/work
+   export DATA_DIR=$(pwd)/data/mortgage.input.csv
+   mkdir -p work data
+   chmod 777 work
+   ```
+   Download a few quarters worth of the [Mortgage Dataset](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
+   to the `$DATA_DIR` location.
+
+3. **Start all services:**
+   ```bash
+   docker-compose up -d
+   ```
+
+4. **Access the interfaces:**
+   - **Jupyter Lab**: http://localhost:8888 (no password required)
+   - **Spark Master UI**: http://localhost:8080
+   - **Spark Worker UI**: http://localhost:8081
+   - **Spark Worker NVdashboard**: http://localhost:8889 
+   - **Spark Driver UI**: http://localhost:4040 (when jobs are running)
+
+5. **Open the demo notebook:**
+   - Navigate to `work/spark-connect-demo.ipynb` in Jupyter Lab
+   - You can also open it in VS Code by selecting http://localhost:8888 as the
+     existing notebook server connection
+   - Run the complete ETL and ML pipeline demonstration
+
+## 📝 Demo Notebook Overview
+
+The `spark-connect-demo.ipynb` notebook demonstrates:
+
+### ETL Pipeline
+- **Data ingestion** from CSV with custom schema
+- **Complex transformations** including date parsing and delinquency calculations
+- **String-to-numeric encoding** for categorical features
+- **Data joins and aggregations** with mortgage performance data
+
+### Machine Learning Workflow
+- **Feature engineering** with FeatureHasher and VectorAssembler
+- **Logistic Regression** training for multi-class prediction
+- **Model evaluation** with performance metrics
+- **GPU vs CPU timing comparisons**
+
+### Key Code Examples
+
+**Connecting to Spark with GPU acceleration:**
+```python
+from pyspark.sql import SparkSession
+
+spark = (
+  SparkSession.builder
+    .remote('sc://spark-connect-server')
+    .appName('GPU-Accelerated-ETL-ML-Demo')
+    .getOrCreate()
+)
+```
+
+**GPU acceleration test:**
+```python
+spark.conf.set('spark.rapids.sql.enabled', True)
+df = (
+  spark.range(2 ** 35)
+    .withColumn('mod10', col('id') % lit(10))
+    .groupBy('mod10').agg(count('*'))
+    .orderBy('mod10')
+)
+df.explain(mode='extended')  # Shows GPU operations in physical plan
+```
+
+**Machine Learning with GPU acceleration:**
+```python
+from pyspark.ml import Pipeline
+from pyspark.ml.classification import LogisticRegression
+from pyspark.ml.feature import VectorAssembler, FeatureHasher
+
+spark.conf.set('spark.connect.ml.backend.classes', 'com.nvidia.rapids.ml.Plugin')
+
+# Feature preparation
+hasher = FeatureHasher(inputCols=categorical_cols, outputCol='hashed_categorical')
+assembler = VectorAssembler().setInputCols(numerical_cols + ['hashed_categorical']).setOutputCol('features')
+
+# Model training
+logistic = LogisticRegression().setFeaturesCol('features').setLabelCol('delinquency_12')
+pipeline = Pipeline().setStages([hasher, assembler, logistic])
+model = pipeline.fit(training_data)
+```
+
+## 🐳 Service Details
+
+### Spark Master
+- **Image**: `apache/spark:4.0.0`
+- **Ports**: 8080 (Web UI), 7077 (Master)
+- **Role**: Cluster coordination and resource management
+
+### Spark Worker (the only GPU node role)
+- **Image**: Custom build based on `apache/spark:4.0.0`
+- **GPU**: NVIDIA GPU support via Docker Compose deploy configuration
+- **Ports**: 8081 (Web UI)
+- **Features**: GPU resource discovery and task execution
+
+### Spark Connect Server
+- **Image**: Custom build based on `apache/spark:4.0.0` with Spark RAPIDS ETL and ML Plugins
+- **RAPIDS Version**: 25.08.0 for CUDA 12
+- **Ports**: 15002 (gRPC), 4040 (Driver UI)
+- **Configuration**: Optimized for GPU acceleration with memory management
+
+### JupyterLab - Spark Connect Client
+- **Image**: Based on `jupyter/minimal-notebook:latest`
+- **Environment**: Pre-configured with PySpark Connect Client
+- **Ports**: 8888 (Jupyter Lab)
+- **Volumes**: Notebooks and work directory mounted
+
+## 📊 Performance Monitoring
+
+You can use tools like nvtop or jupyterlab_nvdashboard running on the GPU host(s)
+
+
+## 🧹 Cleanup
+
+Stop and remove all services:
+```bash
+docker-compose down -v
+```
+
+Remove built images:
+```bash
+docker-compose down --rmi all -v
+```
+
+## Troubleshooting
+
+Repeated executions of the notebook sometimes results in unexpected side effects
+such as a `FileNotFoundException`. To mitigate restart the spark-connect-server
+service
+
+```bash
+$ WORK_DIR=~/work DATA_DIR=~/dais-2025/data/mortgage/raw docker compose restart spark-connect-server
+```
+
+and/or restart the Jupyter kernel
+
+## 📖 Additional Resources
+
+- [Apache Spark 4.0 Documentation](https://spark.apache.org/docs/latest/)
+- [Spark Connect Guide](https://spark.apache.org/docs/latest/spark-connect-overview.html)
+- [NVIDIA RAPIDS Accelerator](https://nvidia.github.io/spark-rapids/)
+- [Data and AI Summit Session](https://www.databricks.com/dataaisummit/session/gpu-accelerated-spark-connect)
diff --git a/examples/spark-connect-for-etl-and-ml/docker-compose.yaml b/examples/spark-connect-for-etl-and-ml/docker-compose.yaml
@@ -0,0 +1,114 @@
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# YAML anchors for shared configurations
+x-spark-common: &spark-common
+  networks:
+    - spark-network
+  volumes:
+    - ${WORK_DIR}:/opt/spark/work-dir
+    - ${DATA_DIR}:/data/mortgage.input.csv
+x-spark-common-env: &spark-common-env
+  SPARK_NO_DAEMONIZE: "1"
+  SPARK_WORKER_DIR: /opt/spark/work-dir
+
+services:
+  # Spark Master Node
+  spark-master:
+    <<: *spark-common
+    image: apache/spark:4.0.0
+    container_name: spark-master
+    hostname: spark-master
+    environment:
+      <<: *spark-common-env
+    ports:
+      - "8080:8080"   # Spark Master Web UI
+      - "7077:7077"   # Spark Master Port
+    command: /opt/spark/sbin/start-master.sh
+
+  # Spark Worker Node (GPU-enabled)
+  spark-worker:
+    <<: *spark-common
+    image: dais25/spark-worker
+    build:
+      context: ./spark-worker
+      dockerfile: Dockerfile
+    container_name: spark-worker
+    hostname: spark-worker
+    environment:
+      <<: *spark-common-env
+    ports:
+      - "8081:8081"   # Spark Worker WebUI
+      - "8889:8889"   # Jupyter Lab Port for nvdashboard
+    depends_on:
+      - spark-master
+    command: >
+      bash -c '/opt/spark/sbin/start-worker.sh spark://spark-master:7077  &
+      jupyter-lab --port 8889 --no-browser --IdentityProvider.token="" --ServerApp.password="" --ServerApp.ip='0.0.0.0' --ServerApp.allow_origin='*' /home/spark/work/README.ipynb'
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              capabilities: [gpu]
+
+  # Spark Connect Server
+  spark-connect-server:
+    <<: *spark-common
+    image: dais25/spark-connect-server
+    build:
+      context: ./spark-connect-server
+      dockerfile: Dockerfile
+    container_name: spark-connect-server
+    hostname: spark-connect-server
+    environment:
+      <<: *spark-common-env
+    ports:
+      - "4040:4040"               # Spark Driver WebUI
+      - "15002:15002"             # Spark Connect grpc
+    depends_on:
+      - spark-master
+      - spark-worker
+    command: /opt/spark/sbin/start-connect-server.sh --driver-memory 4G
+
+  spark-connect-client:
+    <<: *spark-common
+    image: dais25/spark-connect-client
+    # hack same uid as spark
+    user: "185:185"
+    group_add:
+      - "users"
+    build:
+      context: ./spark-connect-client
+      dockerfile: Dockerfile
+    container_name: spark-connect-client
+    hostname: spark-connect-client
+    environment:
+      - CHOWN_HOME=yes
+      - PYDEVD_DISABLE_FILE_VALIDATION=1
+      - MPLCONFIGDIR=/opt/spark/work-dir/mpl
+    ports:
+      - "8888:8888"   # Jupyter Lab Port
+    depends_on:
+      - spark-connect-server
+    command: >
+      start-notebook.sh --NotebookApp.token='' --NotebookApp.password=''
+
+
+networks:
+  spark-network:
+    driver: bridge
diff --git a/examples/spark-connect-for-etl-and-ml/spark-connect-client/Dockerfile b/examples/spark-connect-for-etl-and-ml/spark-connect-client/Dockerfile
@@ -0,0 +1,22 @@
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+FROM jupyter/minimal-notebook:latest
+
+COPY notebooks /home/jovyan/work
+
+RUN pip install -r /home/jovyan/work/requirements.txt