Skip to content

Fully integrate airflow to TDP  #105

@PACordonnier

Description

@PACordonnier

Apache Airflow is already present in tdp-collection-extras but in a basic state.

It needs some more configuration to be able to work with a TDP cluster. Also the role needs a refactor see #59 .

Here are the requirements I think are necessary to consider Airflow integrated :

  • Airflow version should be at least 2.3.X. Airflow 2.4 dropped compatibility with Python 3.6 (is that really an issue ?)
  • A secure webserver needs to be installed (with authentication and SSL)
  • One scheduler should be installed
  • At least one worker needs to be installed
    • Workers relies on Celery Task Queues. It requires a message transport backend (RabbitMQ or Redis are popular)
  • Airflow should connect to a postgresql or mysql for production use
  • Airflow needs to be able to use Hive and Spark in its workflow (configure Spark and Hive providers)
  • HDFS provider is nice to have but the HDFS provider seems very out of date
  • Kerberos and Proxy users needs to be configured
  • Airflow daemons needs to be installed using systemd
  • Enable multi tenancy if possible

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions