Consult SparkSQL performance problems-CodePudding

I was based on SparkSQL development demand as follows:
1, suppose you have two tables, table structure is the same, but different business meaning, save A table is A certain electricity business billing details, one day is our big data platform, can be drawn from A system table B save and billing details, extracted from A financial settlement system,
2, the use of SparkSQL related API, to compare the A and B by one dimension, find out the difference, of course, no difference is one of the best,
3, for small amount of data, such as A, B list only dozens of billing details, I have been developed, the test finished, the basic idea is: will A, B table data into the DataFrame, use intersects (), the join (), the filter () API to find differences, such as

My question is:
1, if is A large amount of data, if there are tens of millions of the billing details, when loaded from A database table A or B, is done in A Worker, or in multiple workers completed? I didn't explicitly set the partition in the code, I don't know how to set up:), so very nervous,

 val options=slightly 
Val dataFrame=sparkSession. Read. The format (" JDBC "). The options (options). The load ()

2, when operating intersects (), the join (), pay attention to the key code which related to performance?

CodePudding user response:

Never suggested that extract data from the JDBC directly to spark (except for collected several positions or dimension tables), to do the first several warehouse construction, time to draw the data in the Hive, then according to the business field partition management, according to the time (date), for example, the commonly used join key fields such as partitions, data according to the data logic points ahead of good area, is the most important performance optimization,
Another hive table partitioning and database index, follow the principle of the left matching,

CodePudding user response:

reference 1st floor LinkSe7en response:

never suggested that extract data from the JDBC directly to spark (except for collected several positions or dimension tables), to do the first several warehouse construction, time to draw the data in the Hive, then according to the business field partition management, according to the time (date), for example, the commonly used join key fields such as partitions, data according to the data logic points ahead of good area, is the most important performance optimization,
Another hive table partitioning and database index, according to the principle of the left matching

Thank you for your advice.