Home > Software design >  The mechanism behind the join operation between one local table and one DB table
The mechanism behind the join operation between one local table and one DB table

Time:10-22

When I register one table from local RDD and one table from DB, I found the join operation between two tables was really slow.

The table from DB is actually a SQL that has multiple join operations, and the local RDD only has 20 records.

I am curious about the mechanism behind it.

Do we pull data from remote DB and execute all tasks in the local Spark cluster?

or Do we have an 'interesting' SQL engine to send an optimized query to DB and wait for the query result back? In my opinion, this way does not make sense, because the query executes really fast in DB.

CodePudding user response:

Spark SQL will run the queries in its side. When you define some tables to join in query in first step it will fetch tables in cluster and save them in memory as RDD or Dataframe, and at last runs some tasks to do query operations.

In your example we suppose the first RDD is already in memory, but the second need to fetch. The datasource will request to your SQL engine to take tables. But before delivering the table, as it is a query with multiple join, your SQL engine will run the query in its side (different from spark cluster) and deliver the results when the table is ready. Let suppose SQL engine take TA seconds to run query (with total result, not top results when you can see in SQL client), and moving data to sql cluster (possibly over the network) takes TB seconds. After TA TB second data is ready for spark to run.

and If TC is the time for join operation, the total time will be total = TA TB TC. You need to check where is you bottleneck. I think TB may be a critical one for large data.

However, when using a cluster with two or more workers, be assure that all nodes involve in the operation. Sometimes, because of wrong coding, the spark will use just one node to do operations. Make sure your data is spread out over the cluster to benefit from data locality.

  • Related