My Process: I am extracting data from several on premise data sources using a Python script and loading them into Azure Blob Storage (raw data). After that, I am performing some ETL actions and use that result in PowerBI.
What I want to achieve: Currently I am able to do the ETL actions via a scheduled trigger in Azure Data Factory. However, I do not want to "lose" time and I would like the ETL to be triggered when the Python script is finished.
Example:: The Python script finishes at 8 AM in the morning. To build some margin, the scheduled Data Factory run is at 8.30 AM in the morning. This is the part where I am looking for some optimalisation so that the Data Factory run could start automatically after the Python script is finished.
(FYI: afterwards, the PowerBI reports refreshes automatically when the ETL actions are finished)
With these tool(s): Microsoft Azure Data Factory or Microsoft Azure Synapse Analytics or Python?
Your advice?
Thanks a lot, Kind regards
CodePudding user response:
You can try the below 2 approaches.
Using a Storage event trigger:
Create a new container in blob storage. At the end of your python code, try to upload a small text or any type of file to this container. Add a Storage Event trigger for this container to your ETL pipeline. So, every time you complete the python script, it will upload the small file to that container which triggers your ETL pipeline in ADF.Using a custom activity for the Python script:
Use the Custom activity in the ADF to execute your Python script. Then, add an Execute Pipeline activity for your ETL pipeline to the custom activity. The custom activity will execute your code and after completion the execute pipeline activity will execute your ETL pipeline.
Please refer this material from azurelib by Deepak Goyal to learn about executing python script using custom activity in ADF.
NOTE: Use the custom activity approach only if you don't have any problems of accessing your on-prem files with python script anywhere in the Azure.