We have one requirement that we need to compare SQL files present in git folder from airflow(we are using GCP cloud composer). I was searching for this solution but everywhere I am getting information about CICD. Can anyone help me here. Thanks in advance.
- Is this possible to access file from git into airflow dag ?
- If yes, can you please help me with any link/steps/reference document which will guide me. I want to read files from git repo and process them in DAG.
CodePudding user response:
Option 1
With python, you can access Github repo files via the API, so you GET request with requets
lib will be enough to read your SQL files:
import requests
response = requests.get("https://raw.githubusercontent.com/<user>/<repo>/master/<file>?token=<token>")
sql_txt = response.text()
Option 2
Also you can use the lib PyGithub:
from github import Github
# using an access token
github_client = Github("<access token>")
## Github Enterprise with custom hostname
## github_client = Github(base_url="https://{hostname}/api/v3", login_or_token="<access token>")
sql_text = (
github_client
.get_user("<repo owner>")
.get_repo("<repo name>")
.get_contents("<file path>")
.decoded_content.decode()
)
Airflow
But with airflow, you can store the Github creds in an airflow connection, then use GithubHook which read the connection and init the github client for you, so you can get this client to read the file as explained in option 2:
from airflow.providers.github.hooks.github import GithubHook
github_hook = GithubHook(github_conn_id="<github conn id>")
github_client = github_hook.client
# read the file ...
You can implement this code in a new operator that extends PythonOperator
if the processing you want to do is simple, and if you have complicated processing you can extend another operator to run your code outside the airflow cluster (ex KubernetesPodOperator
).