-
Notifications
You must be signed in to change notification settings - Fork 61
Add a demo for GPU-accelerated Spark Connect with SQL/ETL and ML #570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Gera Shegalov <[email protected]>
Signed-off-by: Gera Shegalov <[email protected]>
- Modified docker-compose.yaml to use a custom spark-worker image and added build context. - Updated spark-connect-demo.ipynb to streamline GPU acceleration demonstration and improved Spark version output. - Added Dockerfile for custom spark-worker image and included spark-env.sh for GPU resource configuration. Signed-off-by: Gera Shegalov <[email protected]>
- Simplified the command in docker-compose.yaml for the Spark Connect server. - Added csv_raw_schema.ddl to define the schema for raw CSV data. - Introduced name_mapping.csv for mapping seller names. - Updated spark-connect-demo.ipynb to improve GPU processing demonstration and added normalization for bank references. - Created spark-defaults.conf for Spark server configuration, including GPU resource settings. Signed-off-by: Gera Shegalov <[email protected]>
…hancements - Added a new volume mapping in docker-compose.yaml for raw mortgage data input. - Simplified markdown header in spark-connect-demo.ipynb. - Updated GPU processing output identifiers in the notebook for consistency. - Introduced new functions for parsing dates and creating delinquency data frames in the notebook. Signed-off-by: Gera Shegalov <[email protected]>
…k improvements - Changed the input file path in docker-compose.yaml for mortgage data to a more descriptive name. - Cleaned up the spark-connect-demo.ipynb by removing unnecessary execution counts and outputs, and added new ETL processing steps. - Updated spark-defaults.conf to enable event logging and specify the log directory. Signed-off-by: Gera Shegalov <[email protected]>
- Added a new section for the ML modeling phase using the `spark.ml` Pipeline API. - Included steps for feature hashing, vector assembly, and logistic regression model training. - Updated the notebook to sample the ETL output and prepare data for modeling. Signed-off-by: Gera Shegalov <[email protected]>
- Updated the Spark Connect demo notebook to improve clarity and organization by modifying markdown headers and restructuring code cells. - Removed redundant imports and added necessary imports for ML processing. - Enhanced the Dockerfile by adding `scikit-learn` as a dependency for machine learning tasks. Signed-off-by: Gera Shegalov <[email protected]>
- Added a new requirements.txt file to manage Python dependencies for the Spark Connect server, including `spark-rapids-ml`, `scikit-learn`, and others. - Updated the Dockerfile to install dependencies from requirements.txt for better maintainability. - Modified spark-defaults.conf to increase executor memory to 8G for improved performance. - Added an empty code cell in the spark-connect-demo.ipynb for future use. Signed-off-by: Gera Shegalov <[email protected]>
- Added installation of Python dependencies from requirements.txt in the Dockerfile for better management of ML libraries. - Introduced a new requirements.txt file specifying necessary packages like `spark-rapids-ml`, `scikit-learn`, and others. - Cleaned up the spark-connect-demo.ipynb by removing empty code cells to enhance clarity. Signed-off-by: Gera Shegalov <[email protected]>
- Modified Dockerfile to download the CUDA-enabled version of the RAPIDS library. - Updated spark-defaults.conf to include a GPU discovery script and adjusted the jar path for the RAPIDS library. - Enhanced requirements.txt to include additional dependencies for GPU processing. Signed-off-by: Gera Shegalov <[email protected]>
…valuation and GPU support - Updated spark-connect-demo.ipynb to include ML evaluation using MulticlassClassificationEvaluator and added code for visualizing GPU acceleration results. - Modified Dockerfiles for both server and worker to ensure proper installation of dependencies and configuration for GPU support, including adjustments to the order of commands for better clarity. Signed-off-by: Gera Shegalov <[email protected]>
…proved usability - Updated docker-compose.yaml to use environment variables for work and data directories, enhancing flexibility. - Enhanced the spark-connect-demo.ipynb with clearer output messages and improved GPU operation visualization. - Added a requirements.txt file for the Jupyter Lab client to manage Python dependencies effectively. - Modified Dockerfiles for the Spark Connect client, server, and worker to streamline dependency installation and improve clarity. Signed-off-by: Gera Shegalov <[email protected]>
Signed-off-by: Gera Shegalov <[email protected]>
Signed-off-by: Gera Shegalov <[email protected]>
Signed-off-by: Gera Shegalov <[email protected]>
- Increased driver memory allocation in docker-compose.yaml for improved performance. - Added a blank line in the Dockerfile for better readability. - Updated spark-connect-demo.ipynb to reflect changes in Spark session ID and execution counts, and improved file path references for better clarity. - Adjusted requirements.txt by removing the jupyterlab-nvdashboard dependency to streamline package management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I had a few comments in line.
examples/spark-connect-for-etl-and-ml/spark-connect-server/spark-defaults.conf
Outdated
Show resolved
Hide resolved
| # limitations under the License. | ||
|
|
||
|
|
||
| FROM jupyter/minimal-notebook:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is best practice around aligning python versions between client and server side? I think we are ok with anything here, but in general might need to line up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect any combination within allowed version range 3.9+ to work across the wire at least from the Connect point of view. We can actually deliberately try to demonstrate in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I had issues with python/pandas udfs, but we are not using those here. Asking more generally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add and document constraints arising from python serialization
examples/spark-connect-for-etl-and-ml/spark-connect-server/spark-defaults.conf
Show resolved
Hide resolved
Signed-off-by: Gera Shegalov <[email protected]>
Signed-off-by: Gera Shegalov <[email protected]>
|
|
||
| 1. **Clone and navigate to the project:** | ||
| ```bash | ||
| cd spark-connect-for-etl-and-ml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming at top level -> cd examples/spark-connect-for-etl-and-ml
- Set spark.executor.cores to 16 to enhance resource allocation for Spark applications, improving performance and efficiency.
- Introduced spark.shuffle.manager setting to specify the use of RapidsShuffleManager, enhancing shuffle performance in Spark applications.
- Added spark.locality.wait setting to optimize task scheduling and resource allocation in Spark applications.
- Updated docker-compose.yaml to increase driver and executor memory to 16G for better resource allocation. - Cleaned up spark-connect-demo.ipynb by removing unnecessary smoke test code cells. - Removed outdated spark.executor.memory setting from spark-defaults.conf to avoid conflicts with the new configuration.
- Updated docker-compose.yaml to simplify volume mappings and improve data directory structure. - Revised README.md to reflect changes in data directory setup and removed outdated troubleshooting information. - Modified spark-connect-demo.ipynb to adjust file paths for data storage and enhance clarity in data processing steps. - Updated Dockerfile to change the download location for the Rapids jar file, ensuring proper directory structure for Spark resources. - Adjusted spark-defaults.conf to correct the event log directory and local directory paths for consistency with new configurations.
…ssions - Updated docker-compose.yaml to include build arguments for CUDA and RAPIDS versions, allowing for flexible configuration. - Revised README.md to add a permission setting for the data directory, ensuring proper access for data processing. - Modified Dockerfile to utilize build arguments for downloading the RAPIDS jar, improving maintainability and version control. - Adjusted spark-defaults.conf to reference the newly downloaded RAPIDS jar file, ensuring correct Spark resource integration.
- pyspark[connect] is a package with all the spark jars - pyspark-client is a pure python package
- Added a markdown cell to specify the required writable data directory and its contents. - Updated file paths in code cells to use a variable for the data directory, improving flexibility and clarity in data processing steps.
… clarity - Removed hardcoded paths in favor of a variable for the notebook directory, enhancing flexibility. - Added a new markdown cell to clarify the purpose of the notebook's parent directory. - Cleaned up the notebook structure by removing unnecessary metadata and execution counts for a more streamlined presentation.
- Increased driver memory allocation in docker-compose.yaml from 16G to 24G for improved performance. - Added SPARK_REMOTE environment variable in docker-compose.yaml for better client-server communication. - Updated README.md to reflect changes in data directory setup, including new directories for spark-events and improved instructions for data processing. - Introduced requirements.txt for the spark-connect-client Dockerfile to manage dependencies more effectively. - Added a new spark-connect-demo.ipynb notebook for demonstrating GPU-accelerated Spark Connect with the mortgage dataset. - Removed the outdated spark-connect-demo notebook to streamline the project structure.
- Introduced a new image file `example-acceleration-chart.png` to illustrate GPU acceleration results. - Updated `README.md` to reflect the use of a 6GiB RTX A3000 Laptop GPU instead of the previous Quadro RTX 6000, and modified CPU description for clarity. - Added a reference to the new acceleration chart in the README to enhance documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Rishi Chandra <[email protected]>
…o gerashegalov/issue543
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
With branching strategy change, should this be redirected to main branch? |
Yes, please help retarget to main branch, thx |
…d storage management - Enhanced README.md to provide clearer descriptions of Docker services and their roles in the Apache Spark standalone cluster setup. - Added sections on local and global storage access in spark-connect-demo.ipynb, including variable definitions for data directories to improve flexibility. - Updated file paths in the notebook to utilize the new directory variables, ensuring consistency and ease of use across different environments.
|
Looks like need to fix some markdown links. |
[:heavy_multiplication_x:] http://localhost:8888/ → Status: 0 it checks the dead links, while they are expected. So it's ok to merge. |
This PR adds a runnable demo that shows how to use Spark Connect with the NVIDIA RAPIDS Accelerator to run GPU-accelerated workloads. The demo includes code, configuration, and instructions to launch a small end-to-end example so users can validate and benchmark Spark Connect + RAPIDS on GPU-enabled environments.
Resolves #543