-
Couldn't load subscription status.
- Fork 61
Add a demo for GPU-accelerated Spark Connect with SQL/ETL and ML #570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 21 commits
e09d6cf
22b9733
437434f
02238fb
43a0dee
fb998fc
4c7be97
fe96979
72f1267
25dc328
6b3b747
37d2e6c
d83192d
e34bf10
f3b6f1d
5bb73f2
5b84f46
cf3dbf1
effc7eb
9ec75a3
150cd2a
5696636
3bd9a6a
563301e
00d078b
80de902
ab275c1
1f90c81
0272f9c
3600f46
dc0ad2b
01645da
dbe8307
8787ba7
e5491a1
aa3c11a
965ed8f
26be358
5e4556c
b6c8e19
0f40360
d78d1a3
349a601
e0c0f21
84ad509
ef20371
86b55a6
fb82d55
63e9b9d
5a8e6ab
0a5bbde
4dd9991
aadc33e
ddf7181
be4ecd5
16299d4
3e5cdc8
1bd6d4b
99879e8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| # GPU-Accelerated Spark Connect for ETL and ML (Spark 4.0) | ||
|
|
||
| This project demonstrates a complete GPU-accelerated ETL and Machine Learning pipeline using Apache Spark 4.0 with Spark Connect, featuring the RAPIDS Accelerator. The example showcases the capabilities presented in the Data and AI Summit 2025 session: | ||
| [GPU Accelerated Spark Connect](https://www.databricks.com/dataaisummit/session/gpu-accelerated-spark-connect). | ||
| It is similar to the XGBoost example in this repo. | ||
| The key difference is that it uses Spark Connect thus the notebook server node has no heavy dependencies and it uses | ||
| LogisticRegression to demonstrate accelerated Spark MLlib functionality | ||
|
|
||
| ## 🚀 Key Features | ||
|
|
||
| - **Apache Spark 4.0** with cutting-edge Spark Connect capabilities | ||
| - **GPU acceleration** via RAPIDS Accelerator (up to 9x performance improvement) | ||
| - **MLlib over Spark Connect** - new in Spark 4.0 | ||
| - **Zero-code-change acceleration** - existing Spark applications automatically benefit | ||
| - **Complete ETL and ML pipeline** demonstration with mortgage data | ||
| - **Jupyter Lab integration** for interactive development | ||
| - **Docker Compose** setup for easy deployment with clear distinction what dependencies are | ||
| required by what service and where GPUs are really used | ||
|
|
||
| ## 📊 Performance Highlights | ||
|
|
||
| The included demonstration shows: | ||
| - **Comprehensive ETL pipeline** processing mortgage data with complex transformations for feature engineering | ||
| - **Machine Learning workflow** using Logistic Regression with Feature Hashing | ||
| - **GPU vs CPU performance comparison** with visualization of the speedup achieved on the hardware the demo is run | ||
|
|
||
| ## 🏗️ Architecture | ||
|
|
||
| The setup consists of four Docker services: | ||
|
|
||
| 1. **Spark Master** (`spark-master`) - Cluster coordination and job scheduling | ||
| 2. **Spark Worker** (`spark-worker`) - GPU-enabled worker node for task execution | ||
| 3. **Spark Connect Server** (`spark-connect-server`) - gRPC interface with RAPIDS integration | ||
| 4. **Jupyter Lab Client** (`spark-connect-client`) - Interactive development environment | ||
|
|
||
| ## 📋 Prerequisites | ||
|
|
||
| ### Required | ||
| - Docker and Docker Compose | ||
gerashegalov marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - At least 8GB of available RAM | ||
| - Available ports: 8080, 8081, 8888, 7077, 4040, 15002 | ||
eordentlich marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### For GPU Acceleration | ||
| - NVIDIA GPU with CUDA compute capability supported by RAPIDS | ||
| - [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) | ||
| - CUDA 12.x drivers | ||
|
|
||
| ## 🚀 Quick Start | ||
gerashegalov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| 1. **Clone and navigate to the project:** | ||
| ```bash | ||
| cd spark-connect-for-etl-and-ml | ||
|
||
| ``` | ||
|
|
||
| 2. **Set up data directory (if needed):** | ||
| ```bash | ||
| export WORK_DIR=$(pwd)/work | ||
| export DATA_DIR=$(pwd)/data/mortgage.input.csv | ||
| mkdir -p work data | ||
gerashegalov marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| chmod 777 work | ||
| ``` | ||
| Download a few quarters worth of the [Mortgage Dataset](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) | ||
gerashegalov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| to the `$DATA_DIR` location. | ||
|
|
||
| 3. **Start all services:** | ||
| ```bash | ||
| docker-compose up -d | ||
gerashegalov marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
eordentlich marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
| 4. **Access the interfaces:** | ||
gerashegalov marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - **Jupyter Lab**: http://localhost:8888 (no password required) | ||
| - **Spark Master UI**: http://localhost:8080 | ||
| - **Spark Worker UI**: http://localhost:8081 | ||
| - **Spark Worker NVdashboard**: http://localhost:8889 | ||
| - **Spark Driver UI**: http://localhost:4040 (when jobs are running) | ||
|
|
||
| 5. **Open the demo notebook:** | ||
| - Navigate to `work/spark-connect-demo.ipynb` in Jupyter Lab | ||
| - You can also open it in VS Code by selecting http://localhost:8888 as the | ||
| existing notebook server connection | ||
| - Run the complete ETL and ML pipeline demonstration | ||
|
|
||
| ## 📝 Demo Notebook Overview | ||
|
|
||
| The `spark-connect-demo.ipynb` notebook demonstrates: | ||
|
|
||
| ### ETL Pipeline | ||
| - **Data ingestion** from CSV with custom schema | ||
| - **Complex transformations** including date parsing and delinquency calculations | ||
| - **String-to-numeric encoding** for categorical features | ||
| - **Data joins and aggregations** with mortgage performance data | ||
|
|
||
| ### Machine Learning Workflow | ||
| - **Feature engineering** with FeatureHasher and VectorAssembler | ||
| - **Logistic Regression** training for multi-class prediction | ||
| - **Model evaluation** with performance metrics | ||
| - **GPU vs CPU timing comparisons** | ||
|
|
||
| ### Key Code Examples | ||
|
|
||
| **Connecting to Spark with GPU acceleration:** | ||
| ```python | ||
| from pyspark.sql import SparkSession | ||
|
|
||
| spark = ( | ||
| SparkSession.builder | ||
| .remote('sc://spark-connect-server') | ||
| .appName('GPU-Accelerated-ETL-ML-Demo') | ||
| .getOrCreate() | ||
| ) | ||
| ``` | ||
|
|
||
| **GPU acceleration test:** | ||
| ```python | ||
| spark.conf.set('spark.rapids.sql.enabled', True) | ||
| df = ( | ||
| spark.range(2 ** 35) | ||
| .withColumn('mod10', col('id') % lit(10)) | ||
| .groupBy('mod10').agg(count('*')) | ||
| .orderBy('mod10') | ||
| ) | ||
| df.explain(mode='extended') # Shows GPU operations in physical plan | ||
| ``` | ||
|
|
||
| **Machine Learning with GPU acceleration:** | ||
| ```python | ||
| from pyspark.ml import Pipeline | ||
| from pyspark.ml.classification import LogisticRegression | ||
| from pyspark.ml.feature import VectorAssembler, FeatureHasher | ||
|
|
||
| spark.conf.set('spark.connect.ml.backend.classes', 'com.nvidia.rapids.ml.Plugin') | ||
|
|
||
| # Feature preparation | ||
| hasher = FeatureHasher(inputCols=categorical_cols, outputCol='hashed_categorical') | ||
| assembler = VectorAssembler().setInputCols(numerical_cols + ['hashed_categorical']).setOutputCol('features') | ||
|
|
||
| # Model training | ||
| logistic = LogisticRegression().setFeaturesCol('features').setLabelCol('delinquency_12') | ||
| pipeline = Pipeline().setStages([hasher, assembler, logistic]) | ||
| model = pipeline.fit(training_data) | ||
| ``` | ||
|
|
||
| ## 🐳 Service Details | ||
|
|
||
| ### Spark Master | ||
| - **Image**: `apache/spark:4.0.0` | ||
eordentlich marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - **Ports**: 8080 (Web UI), 7077 (Master) | ||
| - **Role**: Cluster coordination and resource management | ||
|
|
||
| ### Spark Worker (the only GPU node role) | ||
| - **Image**: Custom build based on `apache/spark:4.0.0` | ||
| - **GPU**: NVIDIA GPU support via Docker Compose deploy configuration | ||
| - **Ports**: 8081 (Web UI) | ||
| - **Features**: GPU resource discovery and task execution | ||
|
|
||
| ### Spark Connect Server | ||
| - **Image**: Custom build based on `apache/spark:4.0.0` with Spark RAPIDS ETL and ML Plugins | ||
| - **RAPIDS Version**: 25.08.0 for CUDA 12 | ||
| - **Ports**: 15002 (gRPC), 4040 (Driver UI) | ||
| - **Configuration**: Optimized for GPU acceleration with memory management | ||
|
|
||
| ### JupyterLab - Spark Connect Client | ||
| - **Image**: Based on `jupyter/minimal-notebook:latest` | ||
| - **Environment**: Pre-configured with PySpark Connect Client | ||
| - **Ports**: 8888 (Jupyter Lab) | ||
| - **Volumes**: Notebooks and work directory mounted | ||
|
|
||
| ## 📊 Performance Monitoring | ||
|
|
||
| You can use tools like nvtop or jupyterlab_nvdashboard running on the GPU host(s) | ||
|
|
||
|
|
||
| ## 🧹 Cleanup | ||
|
|
||
| Stop and remove all services: | ||
| ```bash | ||
| docker-compose down -v | ||
| ``` | ||
|
|
||
| Remove built images: | ||
| ```bash | ||
| docker-compose down --rmi all -v | ||
| ``` | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| Repeated executions of the notebook sometimes results in unexpected side effects | ||
| such as a `FileNotFoundException`. To mitigate restart the spark-connect-server | ||
| service | ||
|
|
||
| ```bash | ||
| $ WORK_DIR=~/work DATA_DIR=~/dais-2025/data/mortgage/raw docker compose restart spark-connect-server | ||
| ``` | ||
|
|
||
| and/or restart the Jupyter kernel | ||
|
|
||
| ## 📖 Additional Resources | ||
|
|
||
| - [Apache Spark 4.0 Documentation](https://spark.apache.org/docs/latest/) | ||
| - [Spark Connect Guide](https://spark.apache.org/docs/latest/spark-connect-overview.html) | ||
| - [NVIDIA RAPIDS Accelerator](https://nvidia.github.io/spark-rapids/) | ||
| - [Data and AI Summit Session](https://www.databricks.com/dataaisummit/session/gpu-accelerated-spark-connect) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # Licensed to the Apache Software Foundation (ASF) under one or more | ||
| # contributor license agreements. See the NOTICE file distributed with | ||
| # this work for additional information regarding copyright ownership. | ||
| # The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| # (the "License"); you may not use this file except in compliance with | ||
| # the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
|
|
||
| # YAML anchors for shared configurations | ||
| x-spark-common: &spark-common | ||
| networks: | ||
| - spark-network | ||
| volumes: | ||
| - ${WORK_DIR}:/opt/spark/work-dir | ||
| - ${DATA_DIR}:/data/mortgage.input.csv | ||
| x-spark-common-env: &spark-common-env | ||
| SPARK_NO_DAEMONIZE: "1" | ||
| SPARK_WORKER_DIR: /opt/spark/work-dir | ||
|
|
||
| services: | ||
| # Spark Master Node | ||
| spark-master: | ||
| <<: *spark-common | ||
| image: apache/spark:4.0.0 | ||
| container_name: spark-master | ||
| hostname: spark-master | ||
| environment: | ||
| <<: *spark-common-env | ||
| ports: | ||
| - "8080:8080" # Spark Master Web UI | ||
| - "7077:7077" # Spark Master Port | ||
| command: /opt/spark/sbin/start-master.sh | ||
|
|
||
| # Spark Worker Node (GPU-enabled) | ||
| spark-worker: | ||
| <<: *spark-common | ||
| image: dais25/spark-worker | ||
| build: | ||
| context: ./spark-worker | ||
| dockerfile: Dockerfile | ||
| container_name: spark-worker | ||
| hostname: spark-worker | ||
| environment: | ||
| <<: *spark-common-env | ||
| ports: | ||
| - "8081:8081" # Spark Worker WebUI | ||
| - "8889:8889" # Jupyter Lab Port for nvdashboard | ||
| depends_on: | ||
| - spark-master | ||
| command: > | ||
| bash -c '/opt/spark/sbin/start-worker.sh spark://spark-master:7077 & | ||
| jupyter-lab --port 8889 --no-browser --IdentityProvider.token="" --ServerApp.password="" --ServerApp.ip='0.0.0.0' --ServerApp.allow_origin='*' /home/spark/work/README.ipynb' | ||
| deploy: | ||
| resources: | ||
| reservations: | ||
| devices: | ||
| - driver: nvidia | ||
| capabilities: [gpu] | ||
|
|
||
| # Spark Connect Server | ||
| spark-connect-server: | ||
| <<: *spark-common | ||
| image: dais25/spark-connect-server | ||
| build: | ||
| context: ./spark-connect-server | ||
| dockerfile: Dockerfile | ||
| container_name: spark-connect-server | ||
| hostname: spark-connect-server | ||
| environment: | ||
| <<: *spark-common-env | ||
| ports: | ||
| - "4040:4040" # Spark Driver WebUI | ||
| - "15002:15002" # Spark Connect grpc | ||
| depends_on: | ||
| - spark-master | ||
| - spark-worker | ||
| command: /opt/spark/sbin/start-connect-server.sh --driver-memory 4G | ||
|
|
||
| spark-connect-client: | ||
| <<: *spark-common | ||
| image: dais25/spark-connect-client | ||
| # hack same uid as spark | ||
| user: "185:185" | ||
| group_add: | ||
| - "users" | ||
| build: | ||
| context: ./spark-connect-client | ||
| dockerfile: Dockerfile | ||
| container_name: spark-connect-client | ||
| hostname: spark-connect-client | ||
| environment: | ||
| - CHOWN_HOME=yes | ||
| - PYDEVD_DISABLE_FILE_VALIDATION=1 | ||
| - MPLCONFIGDIR=/opt/spark/work-dir/mpl | ||
| ports: | ||
| - "8888:8888" # Jupyter Lab Port | ||
| depends_on: | ||
| - spark-connect-server | ||
| command: > | ||
| start-notebook.sh --NotebookApp.token='' --NotebookApp.password='' | ||
|
|
||
|
|
||
| networks: | ||
| spark-network: | ||
| driver: bridge |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # Licensed to the Apache Software Foundation (ASF) under one or more | ||
| # contributor license agreements. See the NOTICE file distributed with | ||
| # this work for additional information regarding copyright ownership. | ||
| # The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| # (the "License"); you may not use this file except in compliance with | ||
| # the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
|
|
||
| FROM jupyter/minimal-notebook:latest | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is best practice around aligning python versions between client and server side? I think we are ok with anything here, but in general might need to line up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I expect any combination within allowed version range 3.9+ to work across the wire at least from the Connect point of view. We can actually deliberately try to demonstrate in this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I had issues with python/pandas udfs, but we are not using those here. Asking more generally. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can add and document constraints arising from python serialization |
||
|
|
||
| COPY notebooks /home/jovyan/work | ||
|
|
||
| RUN pip install -r /home/jovyan/work/requirements.txt | ||
Uh oh!
There was an error while loading. Please reload this page.