How to improve the performance of spark batch read HBase data-CodePudding

The Configuration of the conf=HBaseConfiguration. The create ();
String tableName="testTable";
Scan Scan=new Scan ();
Scan. SetCaching (10000);
Scan. SetCacheBlocks (false);
The conf. Set (TableInputFormat INPUT_TABLE, tableName);
ClientProtos. Scan proto=ProtobufUtil. ToScan (Scan);
String ScanToString=Base64. EncodeBytes (proto. ToByteArray ());
The conf. Set (TableInputFormat. SCAN, ScanToString);
JavaPairRDD MyRDD=sc
NewAPIHadoopRDD (conf, TableInputFormat. Class,
ImmutableBytesWritable. Class, and the Result. The class);
Used in Spark as Hadoop provides a standard interface read HBase table data (read full table), read about 500 million data, to 20 m +, and the same data stored in the Hive, read only need less than 1 m, the performance difference is very big,
Now project has basic selection to use HBase as big data storage, and the performance of the Spark read HBase data is so slow, has been attempting to directly read HFile, but read only the parsing a HFile file performance is also very slow (400 m data about 90 s), and there is no other solution? Can't don't Spark to Hbase as basis for storage?

CodePudding user response:

Could you tell me your problem solved? You can share the experience?

CodePudding user response:

The latest hbase has provided the spark interface

CodePudding user response:

Look at the HBase Doc