Home > other >  The spark read HBASE data can't print data
The spark read HBASE data can't print data

Time:09-18

According to the official document writing the code below:

//read the data into RDD
Val hBaseRDD=sc. NewAPIHadoopRDD (conf, classOf [TableInputFormat],
ClassOf [org. Apache hadoop. Hbase. IO. ImmutableBytesWritable],
ClassOf [. Org. Apache hadoop, hbase client. The Result])

Val count=hBaseRDD. The count ()
Println (count)

HBaseRDD. Foreach {case (_, result)=& gt; {
//get line key
Val key=Bytes. ToString (result. GetRow)
//by column and column names for columns
Val name=Bytes. ToString (result. GetValue (" A ". GetBytes, "name" getBytes))
Val age=Bytes. ToInt (result. GetValue (" A ". GetBytes, "age" getBytes))
Println (" Row key ":" + key + "FileName:" + name + "age:" + age)
}}

Execute successfully, but failed to print out the detailed information of each of the data, the log is as follows:

18/05/14 16:26:50 INFO DAGScheduler: ResultStage 0 (the count at the test. The scala: 64) finished in 2.515 s
18/05/14 16:26:50 INFO DAGScheduler: Job 0 finished: count the at test. The scala: 64, took 2.642359 s
and
18/05/14 16:26:50 INFO SparkContext: Starting job: foreach ats test. The scala: 71
18/05/14 16:26:50 INFO DAGScheduler: Got job 1 (foreach ats test. The scala: 71) with 1 output partitions
18/05/14 16:26:50 INFO DAGScheduler: Final stage: ResultStage 1 (foreach ats test. The scala: 71)
18/05/14 16:26:50 INFO DAGScheduler: Parents of final stage: the List ()
18/05/14 16:26:50 INFO DAGScheduler: Missing parents: List ()
18/05/14 16:26:50 INFO DAGScheduler: date ResultStage 1 (NewHadoopRDD [0] at newAPIHadoopRDD ats test. The scala: 60), which has no missing parents
18/05/14 16:26:50 INFO MemoryStore: Block broadcast_2 stored as values in the memory (estimated size 2.1 KB, free 897.2 MB)
18/05/14 16:26:50 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1334.0 B, 897.2 MB) free
18/05/14 16:26:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.251.6.153:56001 (size: 1334.0 B, free: 897.6 MB)
18/05/14 16:26:50 INFO SparkContext: Created broadcast from 2 broadcast at DAGScheduler. Scala: 1006
18/05/14 16:26:50 INFO DAGScheduler: date 1 missing from ResultStage 1 (NewHadoopRDD [0] at newAPIHadoopRDD ats test. The scala: 60) (the first 15 tasks are for partitions Vector (0))
18/05/14 16:26:50 INFO TaskSchedulerImpl: Adding task set with 1.0 1 tasks
18/05/14 16:26:50 INFO TaskSetManager: Starting task in stage 0.0 1.0 (dar 1, 10.124.130.14, executor 2, partition 0, ANY, 4919 bytes)
18/05/14 16:26:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.124.130.14:54410 (size: 1334.0 B, free: 366.3 MB)
18/05/14 16:26:51 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.124.130.14:54410 (size: 29.8 KB, free: 366.3 MB)
18/05/14 16:26:52 INFO TaskSetManager: Finished the task in stage 0.0 1.0 (dar) 1 in 2453 ms on 10.124.130.14 executor (2) (1/1)
18/05/14 16:26:52 INFO TaskSchedulerImpl: Removed the TaskSet 1.0, whose tasks have all completed, the from the pool
18/05/14 16:26:52 INFO DAGScheduler: ResultStage 1 (foreach ats test. The scala: 71) finished in 2.454 s
18/05/14 16:26:52 INFO DAGScheduler: Job 1 finished: foreach ats test. The scala: 71, took 2.469872 s


Red font is the total number of rows HBASE table, but there is no any data the following foreach printed, checked for several days didn't find the problem
God, please grant instruction

CodePudding user response:

{case (_, result)=& gt; {
Is what the devil, you don't what values in the result, what do you value

CodePudding user response:

reference 1st floor qq_39869388 response:
{case (_, result)=& gt; {
Is what the devil, you don't what values in the result, what do you value


I doubt this problem, but it should not be:
The count of 1, the above can get the right number of rows
2, online example is this https://blog.csdn.net/u013468917/article/details/52822074

And what could be the cause?

CodePudding user response:

No one know, oneself the top

CodePudding user response:

Result there is no value

CodePudding user response:

Data entered into the hbase

CodePudding user response:

RDD operations are inert operation, foreach can trigger operation?

CodePudding user response:

RDD inside of the map and foreach operation is distributed, inside with a print is out of the log, you want to play it is ok to come collect back first, then the foreach,

CodePudding user response:

Are you foreach operator on the executor, you can be in SparkHistoryServer page, find the corresponding Application, then watch executor page, check the stdout/stderr, just see the print information

CodePudding user response:

HBaseRDD. Foreach {case (_, result) to
HBaseRDD. Collect (). The foreach {case (_, the result should be ok

CodePudding user response:

HBaseRDD. Collect (). The foreach {case (_, result
  • Related