Take more partitions, spark through JDBC RDD serious data skew, everyone to see-CodePudding

Spark version 2.4.5, oracle version 11.2.0.4, spark through JDBC fetch the data from oracle, use oracle rowid way to take over, points and 12 district, the method is such SUBSTR (rowid, 1)), 16, fetch the data there is no problem, take back the data in theory should be generated according to the 16 partitions 16 RDD uniform store? Now my situation is 16 RDD partition is generated, but all the data in the 0 partition, which is in the first partition, 15 other partitions are empty, you everybody to help me see what's going on. There is a problem is to take over should not return an integer? Why into the RDD remainder instead of 9, 9.000000000 code and data are as follows:

JdbcDF=spark. Read. The format (" JDBC ") \
Option (" driver "and" oracle. JDBC. OracleDriver ") \
Option (" url ", "JDBC: oracle: thin: @//172.28.88.26:1521/DSHIELD") \
Option (" dbtable ", "(SELECT the MOD (ASCII (SUBSTR (ROWID, 1)), 20) RN, a. * FROM CFA_PERSONBASEINFO_2 A)") \
Option (" user ", "FISS_NEW") \
Option (" password ", "FISS_NEW") \
Option (" numPartitions ", 16) \
Option (" partitionColumn ", "RN") \
LowerBound. Option (" ", 1) \
Option (" upperBound ", 1000000),
Option (" fetchsize ", 100000),
The load ()

Row(RN=Decimal('9.0000000000'), PERSONID=Decimal('25'), NAME='王艳的', EVERNAME='25', CARDTYPE='I', CARDID='430700199308225054', BIRTHDAY='1993-08-22', SEX='M', GETJOBTIME='26', WORKINGYEARS='19', NATIONLITY='中国', BIRTHPLACE='广东揭阳', NATION='汉族', POLITY='Q', FOREIGNLANG='英语', HIGHESTDEGREE='U', HIGHSTUDY=None, TECHNICRANK='N', ADDRESS='河北省邯郸市邯山区宜都花园8号楼2单元512', POSTCODE='528248', TELEPHONE='0757-28785008', CELLPHONE='13690170786', EMAIL='[email protected]', LINKMAN='叶颖怡', LINKMANPHONE='13392231816', ADDRESSBIRTH='广州', FAMILYINFO=None, UPDATEDATE='2015-09-28', UPDATETIME='21:10:21', MODIFYID='1064818', MODIFYTYPE='M', MODIFYSTATE='X', MODIFYUSER='G01137-B02')

CodePudding user response:

Why into the RDD remainder instead of 9 9.000000000 because you are using Oracle () function returns a decimal
MOD (ASCII (SUBSTR (ROWID, 1)), 20) you this fog, I why want to die 20?

 
Val numPartition=16 
JdbcDF=spark. Read. The format (" JDBC ") \ 
Option (" driver "and" oracle. JDBC. OracleDriver ") \ 
Option (" url ", "JDBC: oracle: thin: @//172.28.88.26:1521/DSHIELD") \ 
Option (" dbtable, "s" (SELECT the MOD (ASCII (SUBSTR (ROWID, 1)), ${numPartition}) RN, a. * FROM CFA_PERSONBASEINFO_2 A) ") \ 
Option (" user ", "FISS_NEW") \ 
Option (" password ", "FISS_NEW") \ 
Option (" numPartitions, "numPartition) \ 
Option (" partitionColumn ", "RN") \ 
Option (" lowerBound ", 0) \ 
Option (" upperBound, "numPartition) \ 
Option (" fetchsize ", 100000), 
The load ()

Have a try

CodePudding user response:

reference 1st floor LinkSe7en response:

why into the RDD remainder instead of 9 9.000000000 because you are using Oracle () function returns a decimal
MOD (ASCII (SUBSTR (ROWID, 1)), 20) you this fog, I why want to die 20?

 
Val numPartition=16 
JdbcDF=spark. Read. The format (" JDBC ") \ 
Option (" driver "and" oracle. JDBC. OracleDriver ") \ 
Option (" url ", "JDBC: oracle: thin: @//172.28.88.26:1521/DSHIELD") \ 
Option (" dbtable, "s" (SELECT the MOD (ASCII (SUBSTR (ROWID, 1)), ${numPartition}) RN, a. * FROM CFA_PERSONBASEINFO_2 A) ") \ 
Option (" user ", "FISS_NEW") \ 
Option (" password ", "FISS_NEW") \ 
Option (" numPartitions, "numPartition) \ 
Option (" partitionColumn ", "RN") \ 
Option (" lowerBound ", 0) \ 
Option (" upperBound, "numPartition) \ 
Option (" fetchsize ", 100000), 
The load ()

Try

Mod20 is to divide 20 area, good read data parallel, now has identified, is a question of upperbound, is changed to recommend to you is the same as partition number. Now meet another problem, the spark to read data quickly, processing RDD also soon, but after processing the data written to the oracle again, cause severe congestion, have what good method can spark even write

CodePudding user response:

refer to the second floor liangmin11 response:

Quote: refer to 1st floor LinkSe7en response:

Why into the RDD remainder instead of 9 9.000000000 because you are using Oracle () function returns a decimal
MOD (ASCII (SUBSTR (ROWID, 1)), 20) you this fog, I why want to die 20?

Val numPartition=16
JdbcDF=spark. Read. The format (" JDBC ") \
Option (" driver "and" oracle. JDBC. OracleDriver ") \
Option (" url ", "JDBC: oracle: thin: @//172.28.88.26:1521/DSHIELD") \
Option (" dbtable, "s" (SELECT the MOD (ASCII (SUBSTR (ROWID, 1)), ${numPartition}) RN, a. * FROM CFA_PERSONBASEINFO_2 A) ") \
Option (" user ", "FISS_NEW") \
Option (" password ", "FISS_NEW") \
Option (" numPartitions, "numPartition) \
Option (" partitionColumn ", "RN") \
Option (" lowerBound ", 0) \
Option (" upperBound, "numPartition) \
Option (" fetchsize ", 100000),
The load ()

Try

Mod20 is to divide 20 area, good read data parallel, now has identified, is a question of upperbound, is changed to recommend to you is the same as partition number. Now meet another problem, the spark to read data quickly, processing RDD also soon, but after processing the data written to the oracle again, cause severe congestion, have what good method can spark even write

Look into the Oracle of time how many Task (partition), too much is not good, in df. Before write to df. Repartition (xx.) write would be much better, specific how many, to look at the performance of your Oracle, to less than other access Oracle application of JDBC connection pool maximum number of connections,