Home > other >  Questions about using the spark for data processing
Questions about using the spark for data processing

Time:09-21

First of all, the data source is SQL server
Consider using JDBC to read
Spark with built-in DataFrameReader data table or object mapping subquery the DataFrame
Method, now let's say the table is very big, if one-time finish reading and conversion of memory may hang up
Is there a good method to read data and subsequent analysis,
Here hopes to use the Scala API gives answer, or don't really have a Java or

Finish read the database before the individual is to use python pandas read_sql
Because of its support iterator returned to read, can get corresponding to traverse the iterator after
File and localize,
Here can also adopt the method of segmentation to subsequent processing in parallel, then "reduce" into a
The results
The corresponding sample code see
http://blog.csdn.net/sinat_30665603/article/details/72794256

Don't know how to achieve the same implementation in the Spark,

People tend to view the data from the database after localization (serialize) as the object for unified
Method of operation, (preferably not SQL)
To correct the wrong place,

Can give similar in Scala Spark logic operation way, hoping to large extent
Use of the advantage of multi-core parallel, at least faster than in the links above design python scripts,

Do you have any related data processing of a good book to recommend? The best is English,
Thank you

CodePudding user response:


I'm sorry, because of personal ability is limited, can't help you,



CodePudding user response:

Oneself the top

CodePudding user response:

Update the export can use JDBC own result set (resultSet)
To finish, it is basically an iterator,
But still I ask recommended bibliography and processing method,

CodePudding user response:

Although oneself out
http://blog.csdn.net/sinat_30665603/article/details/74161591

But the points back

Do you have any recommendation to
Have a good book to the points I

CodePudding user response:

https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#microsoft-sqlserver-example
Please refer to the
Increase in parallel if out of memory, please read the data spread across multiple machines, overflow probability will drop.
Look at the article push-down optimization that part, actually read the data can be filtered in advance a part of the data, this may help you

CodePudding user response:

The data in the database can turn into the hive

CodePudding user response:

Is not your existing mysql or existence of mysql cluster, if you only have a mysql, then no data of local sex, you can use RDD, create multiple partition, partition is distributed, then you can go to read the mysql from different node in the logic control, and then converted to df, as for you is the split, according to what condition can look at your business, if it is a cluster, then try to follow the principle of local data,
  • Related