-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Description
Apache Airflow is already present in tdp-collection-extras but in a basic state.
It needs some more configuration to be able to work with a TDP cluster. Also the role needs a refactor see #59 .
Here are the requirements I think are necessary to consider Airflow integrated :
- Airflow version should be at least 2.3.X. Airflow 2.4 dropped compatibility with Python 3.6 (is that really an issue ?)
- A secure webserver needs to be installed (with authentication and SSL)
- One scheduler should be installed
- At least one worker needs to be installed
- Workers relies on Celery Task Queues. It requires a message transport backend (RabbitMQ or Redis are popular)
- Airflow should connect to a postgresql or mysql for production use
- Airflow needs to be able to use Hive and Spark in its workflow (configure Spark and Hive providers)
- HDFS provider is nice to have but the HDFS provider seems very out of date
- Kerberos and Proxy users needs to be configured
- Airflow daemons needs to be installed using systemd
- Enable multi tenancy if possible
gonzaloetjo and jusstol
Metadata
Metadata
Assignees
Labels
No labels