-
Notifications
You must be signed in to change notification settings - Fork 61
Add a demo for GPU-accelerated Spark Connect with SQL/ETL and ML #570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 54 commits
Commits
Show all changes
59 commits
Select commit
Hold shift + click to select a range
e09d6cf
GPU-accelerated Spark Connect: Spark SQL + Spark MLLib
gerashegalov 22b9733
No real need to persist notebook
gerashegalov 437434f
Update Spark Connect example and Docker configuration
gerashegalov 02238fb
Enhance Spark Connect example with new configurations and schema files
gerashegalov 43a0dee
Update Spark Connect example with new data input path and notebook en…
gerashegalov fb998fc
Update Spark Connect example with enhanced configurations and noteboo…
gerashegalov 4c7be97
Enhance Spark Connect demo notebook with ML modeling phase
gerashegalov fe96979
Refactor Spark Connect demo notebook and update Dockerfile
gerashegalov 72f1267
Enhance Spark Connect server configuration and demo notebook
gerashegalov 25dc328
Refactor Spark Connect Dockerfile and update demo notebook
gerashegalov 6b3b747
Update Spark Connect Dockerfile and configuration for GPU support
gerashegalov 37d2e6c
Enhance Spark Connect demo notebook and Dockerfiles for improved ML e…
gerashegalov d83192d
Refactor Spark Connect configuration and enhance demo notebook for im…
gerashegalov e34bf10
Copyright check
gerashegalov f3b6f1d
Add configs to README
gerashegalov 5bb73f2
Add an entry to the main README
gerashegalov 5b84f46
Adjust notebook name
gerashegalov cf3dbf1
Update Spark Connect configurations and enhance demo notebook
gerashegalov effc7eb
Reorganize requirements.txt files for Spark Connect server and worker
gerashegalov 9ec75a3
Reviews and NVdashboard on spark-worker
gerashegalov 150cd2a
Review iteration
gerashegalov 5696636
remove nvdashboard
gerashegalov 3bd9a6a
Update requirements and environment configuration for Spark Connect
gerashegalov 563301e
Remove irrelevant files from index
gerashegalov 00d078b
address comments including making UI links navigable over ssh tunnel
eordentlich 80de902
cleanup + correct UI pointers for proxy
eordentlich ab275c1
no need for worker and ui port forward with UI reverse proxy
eordentlich 1f90c81
Add SOCKS5 proxy service to Docker Compose and update environment con…
gerashegalov 0272f9c
Remove GPU and CPU demo notebooks for Spark Connect
gerashegalov 3600f46
Update spark-env.sh to set SPARK_PUBLIC_DNS dynamically
gerashegalov dc0ad2b
Enhance Docker Compose setup with NGINX proxy and volume configuration
gerashegalov 01645da
Remove unnecessary whitespace in nginx.conf for cleaner configuration
gerashegalov dbe8307
http_proxy
gerashegalov 8787ba7
Enhance Docker Compose configuration for Spark Connect
gerashegalov e5491a1
Refactor NGINX proxy service in Docker Compose setup
gerashegalov aa3c11a
Refactor NGINX configuration for proxy service
gerashegalov 965ed8f
Merge remote-tracking branch 'eordentlich/issue543_eo' into gerashega…
gerashegalov 26be358
Update NGINX configuration to use $http_host for Host header
gerashegalov 5e4556c
Update README.md to change WORK_DIR permissions from 777 to 1777 for …
gerashegalov b6c8e19
Update README and demo notebook for improved Spark Connect access and…
gerashegalov 0f40360
Update README and configuration files for clarity and licensing
gerashegalov d78d1a3
Add GPU resource configuration to spark-defaults.conf
gerashegalov 349a601
Enhance Docker Compose and configuration for Spark Connect setup
gerashegalov e0c0f21
Remove outdated CSV file sizes from README.md for clarity
gerashegalov 84ad509
Add spark.executor.cores configuration to spark-defaults.conf
gerashegalov ef20371
Add shuffle manager configuration to spark-defaults.conf
gerashegalov 86b55a6
Update spark-defaults.conf to include locality wait configuration
gerashegalov fb82d55
Enhance Spark Connect configuration for improved performance
gerashegalov 63e9b9d
Refactor Spark Connect configuration and documentation
gerashegalov 5a8e6ab
Enhance Spark Connect configuration with dynamic versioning and permi…
gerashegalov 0a5bbde
Update requirements.txt to replace pyspark[connect] with pyspark-client
gerashegalov 4dd9991
Enhance spark-connect-demo notebook with data directory configuration
gerashegalov aadc33e
Refactor spark-connect-demo notebook for improved path management and…
gerashegalov ddf7181
Enhance Spark Connect configuration and documentation
gerashegalov be4ecd5
Add example acceleration chart and update README for GPU demo
gerashegalov 16299d4
Apply suggestion from @rishic3
gerashegalov 3e5cdc8
Merge remote-tracking branch 'gerashegalov/gerashegalov/issue543' int…
gerashegalov 1bd6d4b
Update README and spark-connect-demo notebook for improved clarity an…
gerashegalov 99879e8
Merge remote-tracking branch 'origin/main' into gerashegalov/issue543
gerashegalov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,288 @@ | ||
| # GPU-Accelerated Spark Connect for ETL and ML (Spark 4.0) | ||
|
|
||
| This project demonstrates a complete GPU-accelerated ETL and Machine Learning pipeline using Apache Spark 4.0 with Spark Connect, featuring the RAPIDS Accelerator. The example showcases the capabilities presented in the Data and AI Summit 2025 session: | ||
| [GPU Accelerated Spark Connect](https://www.databricks.com/dataaisummit/session/gpu-accelerated-spark-connect). | ||
| It is similar to the XGBoost example in this repo. | ||
| The key difference is that it uses Spark Connect thus the notebook server node has no heavy dependencies and it uses | ||
| LogisticRegression to demonstrate accelerated Spark MLlib functionality | ||
|
|
||
| ## 🚀 Key Features | ||
|
|
||
| - **Apache Spark 4.0** with cutting-edge Spark Connect capabilities | ||
| - **GPU acceleration** via RAPIDS Accelerator | ||
| - **MLlib over Spark Connect** - new in Spark 4.0 | ||
| - **Zero-code-change acceleration** - existing Spark applications automatically benefit | ||
| - **Complete ETL and ML pipeline** demonstration with mortgage data | ||
| - **Jupyter Lab integration** for interactive development | ||
| - **Docker Compose** setup for easy deployment with clear distinction what dependencies are | ||
| required by what service and where GPUs are really used | ||
|
|
||
| ## 📊 Performance Highlights | ||
|
|
||
| The included demonstration shows: | ||
| - **Comprehensive ETL pipeline** processing mortgage data with complex transformations for feature engineering | ||
| - **Machine Learning workflow** using Logistic Regression with Feature Hashing | ||
| - **GPU vs CPU performance comparison** with visualization of the speedup achieved on the hardware the demo is run | ||
|
|
||
| ## 🏗️ Architecture | ||
|
|
||
| The setup consists of four Docker services: | ||
|
|
||
| 1. **Spark Master** (`spark-master`) - Cluster coordination and job scheduling | ||
| 2. **Spark Worker** (`spark-worker`) - GPU-enabled worker node for task execution. | ||
| 3. **Spark Connect Server** (`spark-connect-server`) - gRPC interface with RAPIDS integration | ||
| 4. **Jupyter Lab - Spark Connect Client** (`spark-connect-client`) - Interactive development environment | ||
|
|
||
| ## 📋 Prerequisites | ||
|
|
||
| ### Required | ||
| - [Docker](https://docs.docker.com/engine/install/) and [Docker Compose](https://docs.docker.com/compose/install/linux) | ||
| - At least 8GB of available RAM | ||
| - Available ports: 2080, 8080, 8081, 8888, 7077, 4040, 15002 | ||
|
|
||
| ### For GPU Acceleration | ||
| - NVIDIA GPU with CUDA compute capability supported by RAPIDS | ||
| - [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) | ||
| - Docker Compose version should be `2.30.x` or newer to avoid an NVIDIA Container Toolkit related bug. [Update](https://docs.docker.com/compose/install/linux) if necessary | ||
| - CUDA 12.x drivers | ||
|
|
||
| ## 🚀 Quick Start | ||
gerashegalov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| 1. **Clone and navigate to the project:** | ||
| ```bash | ||
| cd examples/spark-connect-for-etl-and-ml | ||
| ``` | ||
|
|
||
| 2. **Set up data directory (if needed):** | ||
| ```bash | ||
| export DATA_DIR=$(pwd)/data | ||
| mkdir -p $DATA_DIR/mortgage.input.csv $DATA_DIR/spark-events | ||
| chmod 1777 $DATA_DIR $DATA_DIR/spark-events | ||
|
|
||
| ``` | ||
| Download a few quarters worth of the [Mortgage Dataset](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) | ||
gerashegalov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| to the `$DATA_DIR/mortgage.input.csv` location. The demo at the Data+AI Summit'25 used the following quarters | ||
|
|
||
| ```bash | ||
| $ du -h * | ||
| 503M 2023Q1.csv | ||
| 412M 2023Q2.csv | ||
| 162M 2023Q3.csv | ||
| 1.1G 2023Q4.csv | ||
|
|
||
|
|
||
| ``` | ||
|
|
||
| 3. **Start all services:** | ||
|
|
||
|
|
||
| ```bash | ||
| $ docker compose up -d | ||
| ``` | ||
|
|
||
| (`docker compose` can be used in place of `docker-compose` here and throughout) | ||
|
|
||
| 4. **Access the Web UI interfaces:** | ||
|
|
||
| ***Option 1 (default)*** | ||
|
|
||
| All containers' webUI are available using localhost URI's by default | ||
|
|
||
| - **Jupyter Lab**: http://localhost:8888 (no password required) - Interactive notebook environment | ||
| - **Spark Master UI**: http://localhost:8080 - Cluster coordination and resource management | ||
| - **Spark Worker UI**: http://localhost:8081 - GPU-enabled worker node status and tasks | ||
| - **Spark Driver UI**: http://localhost:4040 - Application monitoring and SQL queries | ||
|
|
||
| ***Option 2*** | ||
|
|
||
| if you launch docker compose in the environment with SPARK_PUBLIC_DNS=container-hostname, all containers' | ||
| web UI but Jupyter Lab is available using the corresponding container host names such as spark-master | ||
|
|
||
| - **Jupyter Lab**: http://localhost:8888 (no password required) - Interactive notebook environment | ||
| - **Spark Master UI**: http://spark-master:8080 - Cluster coordination and resource management | ||
| - **Spark Worker UI**: http://spark-worker:8081 - GPU-enabled worker node status and tasks | ||
| - **Spark Driver UI**: http://spark-connect-server:4040 - Application monitoring and SQL queries | ||
|
|
||
| Docker DNS names require configuring your browser an http proxy on the Docker network exposed at | ||
| http://localhost:2080. | ||
|
|
||
| Here are examples of launching Google Chrome with a temporary user profile without making persistent changes on the browser | ||
|
|
||
| ***Linux*** | ||
|
|
||
| ```bash | ||
| $ google-chrome --user-data-dir="/tmp/chrome-proxy-profile" --proxy-server="http=http://localhost:2080" | ||
| ``` | ||
|
|
||
| ***macOS*** | ||
|
|
||
| ```bash | ||
| $ open -n -a "Google Chrome" --args --user-data-dir="/tmp/chrome-proxy-profile" --proxy-server="http=http://localhost:2080" | ||
| ``` | ||
|
|
||
| ***Launching containers on a remote machine*** | ||
|
|
||
| Your local machine might not have a GPU, and it is common in this case to use a | ||
| remote machine/cluster with GPUs residing in a remote Cloud or on-prem environment | ||
|
|
||
| If you followed the default Option 1 make sure to create local port forwards for | ||
| every webUI port | ||
|
|
||
| ```bash | ||
| ssh <user@gpu-host> -L 8888:localhost:8888 -L 8888:localhost:8080 -L 8081:localhost:8081 -L 4040:localhost:4040 | ||
| ``` | ||
|
|
||
| if you used Option 2 it is sufficient to forward ports only for the HTTP proxy and the Notebook app: | ||
|
|
||
| ```bash | ||
| ssh <user@gpu-host> -L 2080:localhost:2080 -L 8888:localhost:8888 | ||
| ``` | ||
|
|
||
|
|
||
| 5. **Open the demo notebook:** | ||
| - Navigate to `work/spark-connect-demo.ipynb` in Jupyter Lab | ||
| - You can also open it in VS Code by selecting http://localhost:8888 as the | ||
| existing notebook server connection | ||
| - Run the complete ETL and ML pipeline demonstration | ||
|
|
||
|
|
||
| ## 📝 Demo Notebook Overview | ||
|
|
||
| The `spark-connect-demo.ipynb` notebook demonstrates: | ||
|
|
||
| ### ETL Pipeline | ||
| - **Data ingestion** from CSV with custom schema | ||
| - **Complex transformations** including date parsing and delinquency calculations | ||
| - **String-to-numeric encoding** for categorical features | ||
| - **Data joins and aggregations** with mortgage performance data | ||
|
|
||
| ### Machine Learning Workflow | ||
| - **Feature engineering** with FeatureHasher and VectorAssembler | ||
| - **Logistic Regression** training for multi-class prediction | ||
| - **Model evaluation** with performance metrics | ||
| - **GPU vs CPU timing comparisons** | ||
|
|
||
| ### Key Code Examples | ||
|
|
||
| **Connecting to Spark with GPU acceleration:** | ||
| ```python | ||
| from pyspark.sql import SparkSession | ||
|
|
||
| spark = ( | ||
| SparkSession.builder | ||
| .remote('sc://spark-connect-server') | ||
| .appName('GPU-Accelerated-ETL-ML-Demo') | ||
| .getOrCreate() | ||
| ) | ||
| ``` | ||
|
|
||
| **GPU acceleration test:** | ||
| ```python | ||
| spark.conf.set('spark.rapids.sql.enabled', True) | ||
| df = ( | ||
| spark.range(2 ** 35) | ||
| .withColumn('mod10', col('id') % lit(10)) | ||
| .groupBy('mod10').agg(count('*')) | ||
| .orderBy('mod10') | ||
| ) | ||
| df.explain(mode='extended') # Shows GPU operations in physical plan | ||
| ``` | ||
|
|
||
| **Machine Learning with GPU acceleration:** | ||
| ```python | ||
| from pyspark.ml import Pipeline | ||
| from pyspark.ml.classification import LogisticRegression | ||
| from pyspark.ml.feature import VectorAssembler, FeatureHasher | ||
|
|
||
| spark.conf.set('spark.connect.ml.backend.classes', 'com.nvidia.rapids.ml.Plugin') | ||
|
|
||
| # Feature preparation | ||
| hasher = FeatureHasher(inputCols=categorical_cols, outputCol='hashed_categorical') | ||
| assembler = VectorAssembler().setInputCols(numerical_cols + ['hashed_categorical']).setOutputCol('features') | ||
|
|
||
| # Model training | ||
| logistic = LogisticRegression().setFeaturesCol('features').setLabelCol('delinquency_12') | ||
| pipeline = Pipeline().setStages([hasher, assembler, logistic]) | ||
| model = pipeline.fit(training_data) | ||
| ``` | ||
|
|
||
| ### Results | ||
|
|
||
| This demo was tested on a machine with a 24GiB Quadro RTX 600 | ||
|
|
||
| ```bash | ||
| $ nvidia-smi | ||
| +-----------------------------------------------------------------------------------------+ | ||
| | NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 | | ||
| |-----------------------------------------+------------------------+----------------------+ | ||
| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | ||
| | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | ||
| | | | MIG M. | | ||
| |=========================================+========================+======================| | ||
| | 0 Quadro RTX 6000 Off | 00000000:01:00.0 Off | Off | | ||
| | 33% 25C P8 7W / 260W | 10354MiB / 24576MiB | 0% Default | | ||
| | | | N/A | | ||
| +-----------------------------------------+------------------------+----------------------+ | ||
| ``` | ||
|
|
||
| and a 64-vcore CPU | ||
|
|
||
| ## 🐳 Service Details | ||
|
|
||
| ### Spark Master | ||
| - **Image**: `apache/spark:4.0.0` | ||
eordentlich marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - **Ports**: 8080 (Web UI), 7077 (Master) | ||
| - **Role**: Cluster coordination and resource management | ||
|
|
||
| ### Spark Worker (the only GPU node role) | ||
| - **Image**: Custom build based on `apache/spark:4.0.0` | ||
| - **GPU**: NVIDIA GPU support via Docker Compose deploy configuration | ||
| - **Ports**: 8081 (Web UI) | ||
| - **Features**: GPU resource discovery and task execution | ||
|
|
||
| ### Spark Connect Server | ||
| - **Image**: Custom build based on `apache/spark:4.0.0` with Spark RAPIDS ETL and ML Plugins | ||
| - **RAPIDS Version**: 25.08.0 for CUDA 12 | ||
| - **Ports**: 15002 (gRPC), 4040 (Driver UI) | ||
| - **Configuration**: Optimized for GPU acceleration with memory management | ||
|
|
||
| ### JupyterLab - Spark Connect Client | ||
| - **Image**: Based on `jupyter/minimal-notebook:latest` | ||
| - **Environment**: Pre-configured with PySpark Connect Client | ||
| - **Ports**: 8888 (Jupyter Lab) | ||
| - **Volumes**: Notebooks and work directory mounted | ||
|
|
||
| ## 📊 Performance Monitoring | ||
|
|
||
| You can use tools like nvtop, nvitop, btop or jupyterlab_nvdashboard running on the GPU host(s) | ||
|
|
||
|
|
||
| ## 🧹 Cleanup | ||
|
|
||
| Stop and remove all services: | ||
| ```bash | ||
| docker-compose down -v | ||
| ``` | ||
|
|
||
| Remove built images: | ||
| ```bash | ||
| docker-compose down --rmi all -v | ||
| ``` | ||
|
|
||
| ### Logs | ||
| Logs for the spark driver/connect server, standalone master, standalone worker, and jupyter server can be viewed using the respective commands: | ||
| ```bash | ||
| docker logs spark-connect-server | ||
| docker logs spark-master | ||
| docker logs spark-worker | ||
| docker logs spark-connect-client | ||
| ``` | ||
|
|
||
| Spark executor logs can be accessed via the Spark UI as usual. | ||
|
|
||
| ## 📖 Additional Resources | ||
|
|
||
| - [Apache Spark 4.0 Documentation](https://spark.apache.org/docs/latest/) | ||
| - [Spark Connect Guide](https://spark.apache.org/docs/latest/spark-connect-overview.html) | ||
| - [NVIDIA RAPIDS Accelerator](https://nvidia.github.io/spark-rapids/) | ||
| - [Data and AI Summit Session](https://www.databricks.com/dataaisummit/session/gpu-accelerated-spark-connect) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.