Summary
Refactored Spark Lambda Dockerfile to use multi-stage builds for optimal container size and added Ubuntu variant with comprehensive documentation.
Key Changes
🚀 Performance & Optimization:
Implemented multi-stage Docker build reducing final image size
Consolidated RUN commands to minimize Docker layers
Added --no-cache-dir flags for pip installations
Improved cleanup procedures removing temporary files and caches
⬆️ Runtime Modernization:
Upgraded Python runtime from 3.8 → 3.10
Upgraded Java from OpenJDK 1.8 → Amazon Corretto 11
Updated environment paths to reflect Python 3.10 structure
Enhanced security with proper version locking removal
🐧 Platform Extension:
Added Dockerfile.ubuntu for Ubuntu 22.04 deployment
Created generic Spark runner with S3 integration
Implemented non-root user execution for improved security
Added comprehensive documentation in UBUNTU_DOCKERFILE_GUIDE.md
🛠️ Code Quality:
Removed commented legacy code for DEEQU installation
Improved conditional framework installation logic
Better error handling and logging in build process
Standardized environment variable organization
📋 Framework Support:
Maintained compatibility with Delta, Hudi, Iceberg, and Deequ frameworks
Preserved all existing build arguments and configurations
Enhanced JAR download process with better error handling
Benefits
Reduced image size through multi-stage builds
Improved security with latest runtime versions and non-root execution
Better maintainability with cleaner, more organized code
Extended deployment options supporting both Lambda and Ubuntu environments
Enhanced developer experience with comprehensive documentation
Breaking Changes
Python runtime path changed from /var/lang/lib/python3.8/ to /var/lang/lib/python3.10/
Java runtime upgraded may require application compatibility testing