I am using Apache Airflow 2.2.3 with Python 3.9 and run everything in docker containers. When I add connections to airflow I do it via the GUI because this way the passwords were supposed to be encrypted. In order for the encryption to work I installed the python package "apache-airflow[crypto]" on my local machine and generated a Fernet Key that I then put in my docker-compose.yaml as the variable "AIRFLOW__CORE__FERNET_KEY: 'MY_KEY'". I also added the package "apache-airflow[crypto]" to my airflow repositories requirements.txt so that airflow can handle fernet keys.
My questions are the following:
- When I add the fernet key as an environment variable as described, I can see the fernet key in the docker-compose.yaml and also when I enter the container and use os.environ["AIRFLOW__CORE__FERNET_KEY"] it's shown - isn't that unsafe? As far as I understand it credentials can be decrypted using this fernet key.
- When I add connections to airflow I can get their properties via the container CLI by using "airflow connections get CONNECTION_NAME". Although I added the Fernet Key I see the password in plain text here - isn't that supposed to be hidden?
- Unlike passwords the values (/connection strings) in the GUI's "Extra" field do not disappear and are even readable in the GUI. How can I hide those credentials from the GUI and from the CLI?
The airflow GUI tells me that my connections are encrypted so I think that the encryption did work somehow. But what is meant by that statement though when I can clearly see the passwords?
CodePudding user response:
I think you make wrong assumptions about "encryption" and "security". The assumptions that you can prevent user who have access to running software (which airflow CLI gives you) are unrealistic and is not really "physically achievable".
Fernet key is used to encrypt data "At rest" in the database. If your database content is stolen (but not your Airflow program/configuration) - your data is protected. This is the ONLY reason for Fernet Key. It protect your data stored in the database "at rest". But once you have the key (from Airflow runtime) you can decrypt it. Usually the database is in some remote server and it has some backups. As long as the backups are not kept together with the key, if your airflow instances is "safe" but your database or backup gets "stolen" no-one will be able to use that data.
Yes. If you have access to airflow running instance you are supposed to be able to read passwords in clear text. How else do you expect Airflow to work? It needs to read the passwords to authenticate. If you can run airflow program, the data needs to be accessible. There is no work around it and you cannot do it differently this is impossible by design. What you CAN do to protect your data better - you can use Secrets Managers https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/secrets-backend/index.html - but they give you at most possibility of frequent rotation of secrets. Airflow - when running needs to have access to those passwords, otherwise you would not be able to well - authenticate. And once you have access to Airflow Runtime (for example with CLI) there is no way to prevent accessing those passwords that airflow has to know at runtime. This is basic property of any system that needs to authenticate with external system and is accessible at runtime. Airflow is written in Python and you can easily write any code that uses its runtime, so there is no way you can physically protect the runtime passwords that need to be known to "airflow core". At runtime, it needs to know authentication to connect and communicate with external systems. And once you have access to the system, you have - by definition - access to all secrets that system uses at runtime. There is no system in the world that can do it differently - that's just the nature of it. Frequent rotation and temporary authentication is the only way to deal with it so that potentially leaked authentication is not used for a long time.
Modern Airflow (2.1 I believe) has secret masker that masks sensitive data also from extras when you specify it. https://airflow.apache.org/docs/apache-airflow/stable/security/secrets/mask-sensitive-values.html. The secret masker also masks sensitive data in logs, because logs can also be archived and backed up so - similarly to database - it makes sense to protect it. The UI - unlike CLI (which gives you access to runtime of airflow "Core") is just a front-end and it does not give you access to running core, so there masking sensitive data also makes sense.