Target table data for 2494068, but using newAPIHadoopRDD RDD. Read came in after the count results for 1440966, what is the reason, why read in incomplete data sets?
The code is as follows:
Import org, apache hadoop. Hbase. Client. HBaseAdmin
Import org, apache hadoop. Hbase. {HBaseConfiguration, HTableDescriptor TableName}
Import org, apache hadoop. Hbase. Graphs. TableInputFormat
Val conf=HBaseConfiguration. The create ()
The conf. Set (TableInputFormat INPUT_TABLE, "protal")
The conf. Set (" hbase. Zookeeper. Property. ClientPort ", "2181")
The conf. Set (" hbase. Zookeeper. Quorum ", "data6, data7, data8")
Val hBaseRDD=sc. NewAPIHadoopRDD (conf, classOf [TableInputFormat],
ClassOf [org. Apache hadoop. Hbase. IO. ImmutableBytesWritable],
ClassOf [. Org. Apache hadoop, hbase client. The Result])
HBaseRDD. Count
Test the two tables, table 1 for more than 200 ten thousand volume of data, data about 40 g total size, amount of data in table 2 for more than 200 ten thousand, data about total size 10 g,
Results can't read the complete table 1, table 2,
CodePudding user response:
No great god can give me answer it for me? One answer! Genuflect is begged!!!!!!!!!Oneself the top
CodePudding user response:
RDD. The count result for Hbase lines, namely, the number of rowkey you said target data quantity is the number of total columnCodePudding user response:
I just with you on the contrary, I use newAPIHadoopRDD read calls after the count (), the result returned more than the number of Hbase table inside,Later found out that Spark doesn't have to say every time in a row fully read, may be several times, do not know why,
CodePudding user response:
I read HBase method of have a little difference with you,,,You Scan? Not set the Filter? And read RDD
CodePudding user response:
If the spark is a problem with the environment parameter configuration, have appeared during the operation of the warn, such as shuffle have problems?