Skip to content

Conversation

@gerashegalov
Copy link
Collaborator

@gerashegalov gerashegalov commented Sep 15, 2025

This PR adds a runnable demo that shows how to use Spark Connect with the NVIDIA RAPIDS Accelerator to run GPU-accelerated workloads. The demo includes code, configuration, and instructions to launch a small end-to-end example so users can validate and benchmark Spark Connect + RAPIDS on GPU-enabled environments.

Resolves #543

- Modified docker-compose.yaml to use a custom spark-worker image and added build context.
- Updated spark-connect-demo.ipynb to streamline GPU acceleration demonstration and improved Spark version output.
- Added Dockerfile for custom spark-worker image and included spark-env.sh for GPU resource configuration.

Signed-off-by: Gera Shegalov <[email protected]>
- Simplified the command in docker-compose.yaml for the Spark Connect server.
- Added csv_raw_schema.ddl to define the schema for raw CSV data.
- Introduced name_mapping.csv for mapping seller names.
- Updated spark-connect-demo.ipynb to improve GPU processing demonstration and added normalization for bank references.
- Created spark-defaults.conf for Spark server configuration, including GPU resource settings.

Signed-off-by: Gera Shegalov <[email protected]>
…hancements

- Added a new volume mapping in docker-compose.yaml for raw mortgage data input.
- Simplified markdown header in spark-connect-demo.ipynb.
- Updated GPU processing output identifiers in the notebook for consistency.
- Introduced new functions for parsing dates and creating delinquency data frames in the notebook.

Signed-off-by: Gera Shegalov <[email protected]>
…k improvements

- Changed the input file path in docker-compose.yaml for mortgage data to a more descriptive name.
- Cleaned up the spark-connect-demo.ipynb by removing unnecessary execution counts and outputs, and added new ETL processing steps.
- Updated spark-defaults.conf to enable event logging and specify the log directory.

Signed-off-by: Gera Shegalov <[email protected]>
- Added a new section for the ML modeling phase using the `spark.ml` Pipeline API.
- Included steps for feature hashing, vector assembly, and logistic regression model training.
- Updated the notebook to sample the ETL output and prepare data for modeling.

Signed-off-by: Gera Shegalov <[email protected]>
- Updated the Spark Connect demo notebook to improve clarity and organization by modifying markdown headers and restructuring code cells.
- Removed redundant imports and added necessary imports for ML processing.
- Enhanced the Dockerfile by adding `scikit-learn` as a dependency for machine learning tasks.

Signed-off-by: Gera Shegalov <[email protected]>
- Added a new requirements.txt file to manage Python dependencies for the Spark Connect server, including `spark-rapids-ml`, `scikit-learn`, and others.
- Updated the Dockerfile to install dependencies from requirements.txt for better maintainability.
- Modified spark-defaults.conf to increase executor memory to 8G for improved performance.
- Added an empty code cell in the spark-connect-demo.ipynb for future use.

Signed-off-by: Gera Shegalov <[email protected]>
- Added installation of Python dependencies from requirements.txt in the Dockerfile for better management of ML libraries.
- Introduced a new requirements.txt file specifying necessary packages like `spark-rapids-ml`, `scikit-learn`, and others.
- Cleaned up the spark-connect-demo.ipynb by removing empty code cells to enhance clarity.

Signed-off-by: Gera Shegalov <[email protected]>
- Modified Dockerfile to download the CUDA-enabled version of the RAPIDS library.
- Updated spark-defaults.conf to include a GPU discovery script and adjusted the jar path for the RAPIDS library.
- Enhanced requirements.txt to include additional dependencies for GPU processing.

Signed-off-by: Gera Shegalov <[email protected]>
…valuation and GPU support

- Updated spark-connect-demo.ipynb to include ML evaluation using MulticlassClassificationEvaluator and added code for visualizing GPU acceleration results.
- Modified Dockerfiles for both server and worker to ensure proper installation of dependencies and configuration for GPU support, including adjustments to the order of commands for better clarity.

Signed-off-by: Gera Shegalov <[email protected]>
…proved usability

- Updated docker-compose.yaml to use environment variables for work and data directories, enhancing flexibility.
- Enhanced the spark-connect-demo.ipynb with clearer output messages and improved GPU operation visualization.
- Added a requirements.txt file for the Jupyter Lab client to manage Python dependencies effectively.
- Modified Dockerfiles for the Spark Connect client, server, and worker to streamline dependency installation and improve clarity.

Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov gerashegalov self-assigned this Sep 15, 2025
@gerashegalov gerashegalov linked an issue Sep 15, 2025 that may be closed by this pull request
Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov gerashegalov added the documentation Improvements or additions to documentation label Sep 15, 2025
Signed-off-by: Gera Shegalov <[email protected]>
Signed-off-by: Gera Shegalov <[email protected]>
- Increased driver memory allocation in docker-compose.yaml for improved performance.
- Added a blank line in the Dockerfile for better readability.
- Updated spark-connect-demo.ipynb to reflect changes in Spark session ID and execution counts, and improved file path references for better clarity.
- Adjusted requirements.txt by removing the jupyterlab-nvdashboard dependency to streamline package management.
Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I had a few comments in line.

# limitations under the License.


FROM jupyter/minimal-notebook:latest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is best practice around aligning python versions between client and server side? I think we are ok with anything here, but in general might need to line up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect any combination within allowed version range 3.9+ to work across the wire at least from the Connect point of view. We can actually deliberately try to demonstrate in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had issues with python/pandas udfs, but we are not using those here. Asking more generally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add and document constraints arising from python serialization


1. **Clone and navigate to the project:**
```bash
cd spark-connect-for-etl-and-ml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming at top level -> cd examples/spark-connect-for-etl-and-ml

- Set spark.executor.cores to 16 to enhance resource allocation for Spark applications, improving performance and efficiency.
- Introduced spark.shuffle.manager setting to specify the use of RapidsShuffleManager, enhancing shuffle performance in Spark applications.
- Added spark.locality.wait setting to optimize task scheduling and resource allocation in Spark applications.
- Updated docker-compose.yaml to increase driver and executor memory to 16G for better resource allocation.
- Cleaned up spark-connect-demo.ipynb by removing unnecessary smoke test code cells.
- Removed outdated spark.executor.memory setting from spark-defaults.conf to avoid conflicts with the new configuration.
- Updated docker-compose.yaml to simplify volume mappings and improve data directory structure.
- Revised README.md to reflect changes in data directory setup and removed outdated troubleshooting information.
- Modified spark-connect-demo.ipynb to adjust file paths for data storage and enhance clarity in data processing steps.
- Updated Dockerfile to change the download location for the Rapids jar file, ensuring proper directory structure for Spark resources.
- Adjusted spark-defaults.conf to correct the event log directory and local directory paths for consistency with new configurations.
…ssions

- Updated docker-compose.yaml to include build arguments for CUDA and RAPIDS versions, allowing for flexible configuration.
- Revised README.md to add a permission setting for the data directory, ensuring proper access for data processing.
- Modified Dockerfile to utilize build arguments for downloading the RAPIDS jar, improving maintainability and version control.
- Adjusted spark-defaults.conf to reference the newly downloaded RAPIDS jar file, ensuring correct Spark resource integration.
@gerashegalov gerashegalov changed the title Add a demo for GPU-accelerated Spark Connect Add a demo for GPU-accelerated Spark Connect with SQL/ETL and ML Oct 20, 2025
- pyspark[connect] is a package with all the spark jars
- pyspark-client is a pure python package
- Added a markdown cell to specify the required writable data directory and its contents.
- Updated file paths in code cells to use a variable for the data directory, improving flexibility and clarity in data processing steps.
… clarity

- Removed hardcoded paths in favor of a variable for the notebook directory, enhancing flexibility.
- Added a new markdown cell to clarify the purpose of the notebook's parent directory.
- Cleaned up the notebook structure by removing unnecessary metadata and execution counts for a more streamlined presentation.
- Increased driver memory allocation in docker-compose.yaml from 16G to 24G for improved performance.
- Added SPARK_REMOTE environment variable in docker-compose.yaml for better client-server communication.
- Updated README.md to reflect changes in data directory setup, including new directories for spark-events and improved instructions for data processing.
- Introduced requirements.txt for the spark-connect-client Dockerfile to manage dependencies more effectively.
- Added a new spark-connect-demo.ipynb notebook for demonstrating GPU-accelerated Spark Connect with the mortgage dataset.
- Removed the outdated spark-connect-demo notebook to streamline the project structure.
- Introduced a new image file `example-acceleration-chart.png` to illustrate GPU acceleration results.
- Updated `README.md` to reflect the use of a 6GiB RTX A3000 Laptop GPU instead of the previous Quadro RTX 6000, and modified CPU description for clarity.
- Added a reference to the new acceleration chart in the README to enhance documentation.
eordentlich
eordentlich previously approved these changes Oct 22, 2025
Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Co-authored-by: Rishi Chandra <[email protected]>
eordentlich
eordentlich previously approved these changes Oct 23, 2025
Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@eordentlich
Copy link
Collaborator

With branching strategy change, should this be redirected to main branch?

@nvliyuan
Copy link
Collaborator

nvliyuan commented Oct 23, 2025

With branching strategy change, should this be redirected to main branch?

Yes, please help retarget to main branch, thx

…d storage management

- Enhanced README.md to provide clearer descriptions of Docker services and their roles in the Apache Spark standalone cluster setup.
- Added sections on local and global storage access in spark-connect-demo.ipynb, including variable definitions for data directories to improve flexibility.
- Updated file paths in the notebook to utilize the new directory variables, ensuring consistency and ease of use across different environments.
@gerashegalov gerashegalov changed the base branch from branch-25.10 to main October 23, 2025 05:50
@eordentlich
Copy link
Collaborator

Looks like need to fix some markdown links.

@wbo4958
Copy link
Collaborator

wbo4958 commented Oct 24, 2025

Looks like need to fix some markdown links.

[:heavy_multiplication_x:] http://localhost:8888/ → Status: 0
[:heavy_multiplication_x:] http://localhost:8080/ → Status: 0
[:heavy_multiplication_x:] http://localhost:8081/ → Status: 0
[:heavy_multiplication_x:] http://localhost:4040/ → Status: 0
[:heavy_multiplication_x:] http://spark-master:8080/ → Status: 0
[:heavy_multiplication_x:] http://spark-worker:8081/ → Status: 0
[:heavy_multiplication_x:] http://spark-connect-server:4040/ → Status: 0
[:heavy_multiplication_x:] http://localhost:2080/ → Status: 0

it checks the dead links, while they are expected. So it's ok to merge.

@wbo4958 wbo4958 merged commit 32df0bd into NVIDIA:main Oct 24, 2025
3 of 4 checks passed
@gerashegalov gerashegalov deleted the gerashegalov/issue543 branch October 24, 2025 03:28
@gerashegalov gerashegalov added the enhancement New feature or request label Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add DAIS 2025 ETL + ML over Spark Connect demo example

5 participants