Skip to content

Releases: aws-samples/spark-on-aws-lambda

Release 0.4.0

16 Jul 13:05
abfcb58
Compare
Choose a tag to compare

Summary
Refactored Spark Lambda Dockerfile to use multi-stage builds for optimal container size and added Ubuntu variant with comprehensive documentation.

Key Changes
🚀 Performance & Optimization:

Implemented multi-stage Docker build reducing final image size

Consolidated RUN commands to minimize Docker layers

Added --no-cache-dir flags for pip installations

Improved cleanup procedures removing temporary files and caches

⬆️ Runtime Modernization:

Upgraded Python runtime from 3.8 → 3.10

Upgraded Java from OpenJDK 1.8 → Amazon Corretto 11

Updated environment paths to reflect Python 3.10 structure

Enhanced security with proper version locking removal

🐧 Platform Extension:

Added Dockerfile.ubuntu for Ubuntu 22.04 deployment

Created generic Spark runner with S3 integration

Implemented non-root user execution for improved security

Added comprehensive documentation in UBUNTU_DOCKERFILE_GUIDE.md

🛠️ Code Quality:

Removed commented legacy code for DEEQU installation

Improved conditional framework installation logic

Better error handling and logging in build process

Standardized environment variable organization

📋 Framework Support:

Maintained compatibility with Delta, Hudi, Iceberg, and Deequ frameworks

Preserved all existing build arguments and configurations

Enhanced JAR download process with better error handling

Benefits
Reduced image size through multi-stage builds

Improved security with latest runtime versions and non-root execution

Better maintainability with cleaner, more organized code

Extended deployment options supporting both Lambda and Ubuntu environments

Enhanced developer experience with comprehensive documentation

Breaking Changes
Python runtime path changed from /var/lang/lib/python3.8/ to /var/lang/lib/python3.10/

Java runtime upgraded may require application compatibility testing

v0.3.0

27 Feb 19:48
64251f0
Compare
Choose a tag to compare

Releasing SoAL v0.3.0

  1. Added integration with AWS Glue catalog
  2. Added the connectors to Snowflake and Amazon Redshift
  3. Added an option to split large files into smaller 128 MB chunks
  4. Added sample script to show Deequ integration for data quality check
  5. Added the library to read large file for micro batch ingestion

v0.2.0

28 Feb 20:27
ecb27f3
Compare
Choose a tag to compare
v0.2.0 Pre-release
Pre-release

Release v0.2.0 introduces several new features and improvements, including:

  • Architecture to submit the PySpark script from Amazon S3 on AWS Lambda using Spark on Docker. This feature enables users to easily run PySpark jobs on AWS Lambda and impact less when pyspark code requires update.

  • SAM (Serverless Application Model) templates to automatically build and deploy Docker images to AWS ECR (Elastic Container Registry) and AWS Lambda. This feature makes it easy to deploy and manage Docker images on AWS Lambda using SAM templates.

  • Apache Hudi integration with Spark on AWS Lambda. This feature enables users to use Apache Hudi, a storage system for managing small to medium (up to 200MB payload) and complex data sets on Amazon S3.

These features enhance the usability and scalability of Spark on AWS Lambda, providing users with more flexibility and options for running PySpark jobs on AWS Lambda.

v0.1.0: Update README.md

23 Feb 19:44
a4852aa
Compare
Choose a tag to compare
Pre-release

Spark on AWS Lambda is a standalone installation of Spark that runs on AWS Lambda using a Docker container. It provides a cost-effective solution for event-driven pipelines with smaller files, where heavier engines like Amazon EMR or AWS Glue incur overhead costs and operate more slowly.

Release 0.1.0 Features:

  1. Dockerfile that has Pyspark and dependencies installed.
  2. Sample script to read and write csv file on Amazon S3
  3. Authentication and authorization framework for connecting to Amazon S3