We use Synapse Notebooks to perform data transformations and load the data into fact and dimension tables within our ADLSG2 data lake. We are disappointed with the performance of Synapse and the time it takes to load the transformed tables to the data lake. We don't know how to get the Notebooks to run more quickly or in parallel.
Some of the things we have tried are as follows:
- Collocated Synapse with our Data Lake
- Switched to Premium Data Lake Storage with solid state drives
- Cleaned up cached temporary SQL views and data frames the moment they are no longer needed
- use PySpark (Python) functions for better performance
- use Serverless SQL pools to minimize costs (we've tried small, medium and large sized spark pools)
- use unmanaged external tables with parquet format
- tried using nested notebooks
- tried using pipelines that invoke notebooks in parallel (just learned that this is not possible https://docs.microsoft.com/en-us/azure/synapse-analytics/synapse-notebook-activity?tabs=classical)
We are not dealing with large amounts of data:
- Bronze - 110 tables (3 GB) ~1,000 blobs
- Silver - 14 tables (9 GB) ~16,000 blobs
- Gold - 66 tables (4 GB) ~10,000 blobs
In our SQL Managed instance, the same transformations (using SQL Stored procedures) are completed in approximately 15 minutes. Using Synapse notebooks along with premium data lake storage, we get the following timings to load the data model:
- Small Spark Pool (4 vCores/32 GB) 3-10 nodes - 38 minutes
- Medium Spark Pool (8 vCores/64GB) 3-50 nodes - 30 minutes
- Large Spark Pool (16 vCores/128 GB) 3-70 nodes - 25 minutes
The cost to run the large and medium spark pool is very prohibitive. Our goal is to refresh our data model every half hour.
How can we optimize the notebooks? The bottleneck is the time it takes to write the tables to the data lake.
Is there any way to run the loading of the tables in parallel instead of sequentially? Should we abandon serverless SQL pools and use dedicated ones instead? Any suggestions would be greatly appreciated.
Thanks
CodePudding user response:
As mentioned by @Veikko, the root cause are the transformations using stored procedure-like logic, not the time it takes to load the data into the data lake. The bulk of time is spent with loading the data frames from the data lake and performing the sql transformations.
To test this, I ran two notebooks that loaded identical amounts of data into the data lake:
Notebook 1:
- defined 4 functions
- loaded 29 dataframes into memory
- converted the dataframes into temporary sql views
- created 21 sql temporary views which perform the transformations
- created the table in the data lake (fact_equipment)
- dropped the temporary sql views and deleted the dataframes from memory
Total time = 5 minutes
Notebook 2:
- defined 2 functions
- loaded 1 dataframe into memory
- converted the dataframe into a temporary sql view
- created 1 sql temporary view (select * from fact_equipment)
- created the table in the data lake (fact_equipment)
Total time = 1 minute
So, my primary issue is with how to speed up the transformations and my secondary issue is with finding a way to run notebooks in parallel. I'll post another question with how I did my transformations to see what I can do to make them run more quickly.
CodePudding user response:
There are two possibilities to load and transform data. ETL and ELT. The best practice for the larger set of data is to use ELT instead of ETL. The reason was mentioned below.
Traditional SMP dedicated SQL pools use an Extract, Transform, and Load (ETL) process for loading data. Synapse SQL, within Azure Synapse Analytics, uses distributed query processing architecture that takes advantage of the scalability and flexibility of compute and storage resources.
Using an Extract, Load, and Transform (ELT) process leverages built-in distributed query processing capabilities and eliminates the resources needed for data transformation prior to loading.
For quick glance,
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-elt-data-loading