Add a demo for GPU-accelerated Spark Connect with SQL/ETL and ML #570

gerashegalov · 2025-09-15T08:17:35Z

This PR adds a runnable demo that shows how to use Spark Connect with the NVIDIA RAPIDS Accelerator to run GPU-accelerated workloads. The demo includes code, configuration, and instructions to launch a small end-to-end example so users can validate and benchmark Spark Connect + RAPIDS on GPU-enabled environments.

Resolves #543

Signed-off-by: Gera Shegalov <[email protected]>

- Modified docker-compose.yaml to use a custom spark-worker image and added build context. - Updated spark-connect-demo.ipynb to streamline GPU acceleration demonstration and improved Spark version output. - Added Dockerfile for custom spark-worker image and included spark-env.sh for GPU resource configuration. Signed-off-by: Gera Shegalov <[email protected]>

- Simplified the command in docker-compose.yaml for the Spark Connect server. - Added csv_raw_schema.ddl to define the schema for raw CSV data. - Introduced name_mapping.csv for mapping seller names. - Updated spark-connect-demo.ipynb to improve GPU processing demonstration and added normalization for bank references. - Created spark-defaults.conf for Spark server configuration, including GPU resource settings. Signed-off-by: Gera Shegalov <[email protected]>

…hancements - Added a new volume mapping in docker-compose.yaml for raw mortgage data input. - Simplified markdown header in spark-connect-demo.ipynb. - Updated GPU processing output identifiers in the notebook for consistency. - Introduced new functions for parsing dates and creating delinquency data frames in the notebook. Signed-off-by: Gera Shegalov <[email protected]>

…k improvements - Changed the input file path in docker-compose.yaml for mortgage data to a more descriptive name. - Cleaned up the spark-connect-demo.ipynb by removing unnecessary execution counts and outputs, and added new ETL processing steps. - Updated spark-defaults.conf to enable event logging and specify the log directory. Signed-off-by: Gera Shegalov <[email protected]>

- Added a new section for the ML modeling phase using the `spark.ml` Pipeline API. - Included steps for feature hashing, vector assembly, and logistic regression model training. - Updated the notebook to sample the ETL output and prepare data for modeling. Signed-off-by: Gera Shegalov <[email protected]>

- Updated the Spark Connect demo notebook to improve clarity and organization by modifying markdown headers and restructuring code cells. - Removed redundant imports and added necessary imports for ML processing. - Enhanced the Dockerfile by adding `scikit-learn` as a dependency for machine learning tasks. Signed-off-by: Gera Shegalov <[email protected]>

- Added a new requirements.txt file to manage Python dependencies for the Spark Connect server, including `spark-rapids-ml`, `scikit-learn`, and others. - Updated the Dockerfile to install dependencies from requirements.txt for better maintainability. - Modified spark-defaults.conf to increase executor memory to 8G for improved performance. - Added an empty code cell in the spark-connect-demo.ipynb for future use. Signed-off-by: Gera Shegalov <[email protected]>

- Added installation of Python dependencies from requirements.txt in the Dockerfile for better management of ML libraries. - Introduced a new requirements.txt file specifying necessary packages like `spark-rapids-ml`, `scikit-learn`, and others. - Cleaned up the spark-connect-demo.ipynb by removing empty code cells to enhance clarity. Signed-off-by: Gera Shegalov <[email protected]>

- Modified Dockerfile to download the CUDA-enabled version of the RAPIDS library. - Updated spark-defaults.conf to include a GPU discovery script and adjusted the jar path for the RAPIDS library. - Enhanced requirements.txt to include additional dependencies for GPU processing. Signed-off-by: Gera Shegalov <[email protected]>

…valuation and GPU support - Updated spark-connect-demo.ipynb to include ML evaluation using MulticlassClassificationEvaluator and added code for visualizing GPU acceleration results. - Modified Dockerfiles for both server and worker to ensure proper installation of dependencies and configuration for GPU support, including adjustments to the order of commands for better clarity. Signed-off-by: Gera Shegalov <[email protected]>

…proved usability - Updated docker-compose.yaml to use environment variables for work and data directories, enhancing flexibility. - Enhanced the spark-connect-demo.ipynb with clearer output messages and improved GPU operation visualization. - Added a requirements.txt file for the Jupyter Lab client to manage Python dependencies effectively. - Modified Dockerfiles for the Spark Connect client, server, and worker to streamline dependency installation and improve clarity. Signed-off-by: Gera Shegalov <[email protected]>

Signed-off-by: Gera Shegalov <[email protected]>

- Increased driver memory allocation in docker-compose.yaml for improved performance. - Added a blank line in the Dockerfile for better readability. - Updated spark-connect-demo.ipynb to reflect changes in Spark session ID and execution counts, and improved file path references for better clarity. - Adjusted requirements.txt by removing the jupyterlab-nvdashboard dependency to streamline package management.

examples/spark-connect-for-etl-and-ml/README.md

eordentlich

Nice! I had a few comments in line.

examples/spark-connect-for-etl-and-ml/README.md

examples/spark-connect-for-etl-and-ml/spark-worker/Dockerfile

README.md

examples/spark-connect-for-etl-and-ml/README.md

examples/spark-connect-for-etl-and-ml/spark-worker/requirements.txt

examples/spark-connect-for-etl-and-ml/spark-connect-server/spark-defaults.conf

eordentlich · 2025-09-16T22:56:44Z

examples/spark-connect-for-etl-and-ml/spark-connect-client/Dockerfile

+# limitations under the License.
+
+
+FROM jupyter/minimal-notebook:latest


What is best practice around aligning python versions between client and server side? I think we are ok with anything here, but in general might need to line up.

I expect any combination within allowed version range 3.9+ to work across the wire at least from the Connect point of view. We can actually deliberately try to demonstrate in this PR.

I think I had issues with python/pandas udfs, but we are not using those here. Asking more generally.

we can add and document constraints arising from python serialization

examples/spark-connect-for-etl-and-ml/spark-connect-server/spark-defaults.conf

Signed-off-by: Gera Shegalov <[email protected]>

examples/spark-connect-for-etl-and-ml/README.md

eordentlich · 2025-09-18T20:43:11Z

examples/spark-connect-for-etl-and-ml/README.md

+
+1. **Clone and navigate to the project:**
+   ```bash
+   cd spark-connect-for-etl-and-ml


assuming at top level -> cd examples/spark-connect-for-etl-and-ml

examples/spark-connect-for-etl-and-ml/README.md

- Set spark.executor.cores to 16 to enhance resource allocation for Spark applications, improving performance and efficiency.

- Introduced spark.shuffle.manager setting to specify the use of RapidsShuffleManager, enhancing shuffle performance in Spark applications.

- Added spark.locality.wait setting to optimize task scheduling and resource allocation in Spark applications.

- Updated docker-compose.yaml to increase driver and executor memory to 16G for better resource allocation. - Cleaned up spark-connect-demo.ipynb by removing unnecessary smoke test code cells. - Removed outdated spark.executor.memory setting from spark-defaults.conf to avoid conflicts with the new configuration.

- Updated docker-compose.yaml to simplify volume mappings and improve data directory structure. - Revised README.md to reflect changes in data directory setup and removed outdated troubleshooting information. - Modified spark-connect-demo.ipynb to adjust file paths for data storage and enhance clarity in data processing steps. - Updated Dockerfile to change the download location for the Rapids jar file, ensuring proper directory structure for Spark resources. - Adjusted spark-defaults.conf to correct the event log directory and local directory paths for consistency with new configurations.

…ssions - Updated docker-compose.yaml to include build arguments for CUDA and RAPIDS versions, allowing for flexible configuration. - Revised README.md to add a permission setting for the data directory, ensuring proper access for data processing. - Modified Dockerfile to utilize build arguments for downloading the RAPIDS jar, improving maintainability and version control. - Adjusted spark-defaults.conf to reference the newly downloaded RAPIDS jar file, ensuring correct Spark resource integration.

- pyspark[connect] is a package with all the spark jars - pyspark-client is a pure python package

- Added a markdown cell to specify the required writable data directory and its contents. - Updated file paths in code cells to use a variable for the data directory, improving flexibility and clarity in data processing steps.

… clarity - Removed hardcoded paths in favor of a variable for the notebook directory, enhancing flexibility. - Added a new markdown cell to clarify the purpose of the notebook's parent directory. - Cleaned up the notebook structure by removing unnecessary metadata and execution counts for a more streamlined presentation.

- Increased driver memory allocation in docker-compose.yaml from 16G to 24G for improved performance. - Added SPARK_REMOTE environment variable in docker-compose.yaml for better client-server communication. - Updated README.md to reflect changes in data directory setup, including new directories for spark-events and improved instructions for data processing. - Introduced requirements.txt for the spark-connect-client Dockerfile to manage dependencies more effectively. - Added a new spark-connect-demo.ipynb notebook for demonstrating GPU-accelerated Spark Connect with the mortgage dataset. - Removed the outdated spark-connect-demo notebook to streamline the project structure.

- Introduced a new image file `example-acceleration-chart.png` to illustrate GPU acceleration results. - Updated `README.md` to reflect the use of a 6GiB RTX A3000 Laptop GPU instead of the previous Quadro RTX 6000, and modified CPU description for clarity. - Added a reference to the new acceleration chart in the README to enhance documentation.

eordentlich

👍

README.md

Co-authored-by: Rishi Chandra <[email protected]>

…o gerashegalov/issue543

eordentlich

👍

eordentlich · 2025-10-23T02:17:43Z

With branching strategy change, should this be redirected to main branch?

nvliyuan · 2025-10-23T02:30:49Z

With branching strategy change, should this be redirected to main branch?

Yes, please help retarget to main branch, thx

…d storage management - Enhanced README.md to provide clearer descriptions of Docker services and their roles in the Apache Spark standalone cluster setup. - Added sections on local and global storage access in spark-connect-demo.ipynb, including variable definitions for data directories to improve flexibility. - Updated file paths in the notebook to utilize the new directory variables, ensuring consistency and ease of use across different environments.

eordentlich · 2025-10-23T18:05:20Z

Looks like need to fix some markdown links.

wbo4958 · 2025-10-24T02:16:56Z

Looks like need to fix some markdown links.

[:heavy_multiplication_x:] http://localhost:8888/ → Status: 0
[:heavy_multiplication_x:] http://localhost:8080/ → Status: 0
[:heavy_multiplication_x:] http://localhost:8081/ → Status: 0
[:heavy_multiplication_x:] http://localhost:4040/ → Status: 0
[:heavy_multiplication_x:] http://spark-master:8080/ → Status: 0
[:heavy_multiplication_x:] http://spark-worker:8081/ → Status: 0
[:heavy_multiplication_x:] http://spark-connect-server:4040/ → Status: 0
[:heavy_multiplication_x:] http://localhost:2080/ → Status: 0

it checks the dead links, while they are expected. So it's ok to merge.

gerashegalov added 13 commits September 12, 2025 14:19

GPU-accelerated Spark Connect: Spark SQL + Spark MLLib

e09d6cf

Signed-off-by: Gera Shegalov <[email protected]>

No real need to persist notebook

22b9733

Signed-off-by: Gera Shegalov <[email protected]>

gerashegalov self-assigned this Sep 15, 2025

gerashegalov linked an issue Sep 15, 2025 that may be closed by this pull request

Add DAIS 2025 ETL + ML over Spark Connect demo example #543

Closed

gerashegalov requested a review from eordentlich September 15, 2025 08:18

Copyright check

e34bf10

Signed-off-by: Gera Shegalov <[email protected]>

gerashegalov added the documentation Improvements or additions to documentation label Sep 15, 2025

gerashegalov added 4 commits September 15, 2025 09:45

Add configs to README

f3b6f1d

Signed-off-by: Gera Shegalov <[email protected]>

Add an entry to the main README

5bb73f2

Adjust notebook name

5b84f46

Signed-off-by: Gera Shegalov <[email protected]>

gerashegalov requested a review from nvliyuan September 16, 2025 16:40

rishic3 reviewed Sep 16, 2025

View reviewed changes

examples/spark-connect-for-etl-and-ml/README.md Show resolved Hide resolved

Reorganize requirements.txt files for Spark Connect server and worker

effc7eb

eordentlich reviewed Sep 16, 2025

View reviewed changes

eordentlich reviewed Sep 17, 2025

View reviewed changes

examples/spark-connect-for-etl-and-ml/spark-connect-server/spark-defaults.conf Show resolved Hide resolved

gerashegalov added 2 commits September 17, 2025 12:05

Reviews and NVdashboard on spark-worker

9ec75a3

Signed-off-by: Gera Shegalov <[email protected]>

Review iteration

150cd2a

Signed-off-by: Gera Shegalov <[email protected]>

eordentlich reviewed Sep 18, 2025

View reviewed changes

gerashegalov added 6 commits October 17, 2025 21:37

Add spark.executor.cores configuration to spark-defaults.conf

84ad509

- Set spark.executor.cores to 16 to enhance resource allocation for Spark applications, improving performance and efficiency.

Add shuffle manager configuration to spark-defaults.conf

ef20371

- Introduced spark.shuffle.manager setting to specify the use of RapidsShuffleManager, enhancing shuffle performance in Spark applications.

Update spark-defaults.conf to include locality wait configuration

86b55a6

- Added spark.locality.wait setting to optimize task scheduling and resource allocation in Spark applications.

gerashegalov changed the title ~~Add a demo for GPU-accelerated Spark Connect~~ Add a demo for GPU-accelerated Spark Connect with SQL/ETL and ML Oct 20, 2025

gerashegalov added 5 commits October 20, 2025 16:50

Update requirements.txt to replace pyspark[connect] with pyspark-client

0a5bbde

- pyspark[connect] is a package with all the spark jars - pyspark-client is a pure python package

eordentlich previously approved these changes Oct 22, 2025

View reviewed changes

rishic3 reviewed Oct 22, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

Apply suggestion from @rishic3

16299d4

Co-authored-by: Rishi Chandra <[email protected]>

gerashegalov dismissed eordentlich’s stale review via 16299d4 October 23, 2025 00:44

Merge remote-tracking branch 'gerashegalov/gerashegalov/issue543' int…

3e5cdc8

…o gerashegalov/issue543

eordentlich previously approved these changes Oct 23, 2025

View reviewed changes

gerashegalov dismissed eordentlich’s stale review via 1bd6d4b October 23, 2025 05:38

Merge remote-tracking branch 'origin/main' into gerashegalov/issue543

99879e8

gerashegalov changed the base branch from branch-25.10 to main October 23, 2025 05:50

eordentlich approved these changes Oct 23, 2025

View reviewed changes

wbo4958 merged commit 32df0bd into NVIDIA:main Oct 24, 2025
3 of 4 checks passed

gerashegalov deleted the gerashegalov/issue543 branch October 24, 2025 03:28

gerashegalov added the enhancement New feature or request label Oct 24, 2025

		# limitations under the License.


		FROM jupyter/minimal-notebook:latest

Add a demo for GPU-accelerated Spark Connect with SQL/ETL and ML #570

Add a demo for GPU-accelerated Spark Connect with SQL/ETL and ML #570

Uh oh!

Conversation

gerashegalov commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

eordentlich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eordentlich Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gerashegalov Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

eordentlich Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

gerashegalov Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eordentlich Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eordentlich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eordentlich left a comment

Choose a reason for hiding this comment

Uh oh!

eordentlich commented Oct 23, 2025

Uh oh!

nvliyuan commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eordentlich commented Oct 23, 2025

Uh oh!

wbo4958 commented Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gerashegalov commented Sep 15, 2025 •

edited

Loading

nvliyuan commented Oct 23, 2025 •

edited

Loading