Releases: aws-samples/spark-on-aws-lambda
Release 0.4.0
Summary
Refactored Spark Lambda Dockerfile to use multi-stage builds for optimal container size and added Ubuntu variant with comprehensive documentation.
Key Changes
🚀 Performance & Optimization:
Implemented multi-stage Docker build reducing final image size
Consolidated RUN commands to minimize Docker layers
Added --no-cache-dir flags for pip installations
Improved cleanup procedures removing temporary files and caches
⬆️ Runtime Modernization:
Upgraded Python runtime from 3.8 → 3.10
Upgraded Java from OpenJDK 1.8 → Amazon Corretto 11
Updated environment paths to reflect Python 3.10 structure
Enhanced security with proper version locking removal
🐧 Platform Extension:
Added Dockerfile.ubuntu for Ubuntu 22.04 deployment
Created generic Spark runner with S3 integration
Implemented non-root user execution for improved security
Added comprehensive documentation in UBUNTU_DOCKERFILE_GUIDE.md
🛠️ Code Quality:
Removed commented legacy code for DEEQU installation
Improved conditional framework installation logic
Better error handling and logging in build process
Standardized environment variable organization
📋 Framework Support:
Maintained compatibility with Delta, Hudi, Iceberg, and Deequ frameworks
Preserved all existing build arguments and configurations
Enhanced JAR download process with better error handling
Benefits
Reduced image size through multi-stage builds
Improved security with latest runtime versions and non-root execution
Better maintainability with cleaner, more organized code
Extended deployment options supporting both Lambda and Ubuntu environments
Enhanced developer experience with comprehensive documentation
Breaking Changes
Python runtime path changed from /var/lang/lib/python3.8/ to /var/lang/lib/python3.10/
Java runtime upgraded may require application compatibility testing
v0.3.0
Releasing SoAL v0.3.0
- Added integration with AWS Glue catalog
- Added the connectors to Snowflake and Amazon Redshift
- Added an option to split large files into smaller 128 MB chunks
- Added sample script to show Deequ integration for data quality check
- Added the library to read large file for micro batch ingestion
v0.2.0
Release v0.2.0 introduces several new features and improvements, including:
-
Architecture to submit the PySpark script from Amazon S3 on AWS Lambda using Spark on Docker. This feature enables users to easily run PySpark jobs on AWS Lambda and impact less when pyspark code requires update.
-
SAM (Serverless Application Model) templates to automatically build and deploy Docker images to AWS ECR (Elastic Container Registry) and AWS Lambda. This feature makes it easy to deploy and manage Docker images on AWS Lambda using SAM templates.
-
Apache Hudi integration with Spark on AWS Lambda. This feature enables users to use Apache Hudi, a storage system for managing small to medium (up to 200MB payload) and complex data sets on Amazon S3.
These features enhance the usability and scalability of Spark on AWS Lambda, providing users with more flexibility and options for running PySpark jobs on AWS Lambda.
v0.1.0: Update README.md
Spark on AWS Lambda is a standalone installation of Spark that runs on AWS Lambda using a Docker container. It provides a cost-effective solution for event-driven pipelines with smaller files, where heavier engines like Amazon EMR or AWS Glue incur overhead costs and operate more slowly.
Release 0.1.0 Features:
- Dockerfile that has Pyspark and dependencies installed.
- Sample script to read and write csv file on Amazon S3
- Authentication and authorization framework for connecting to Amazon S3