String tableName="testTable";
Scan Scan=new Scan ();
Scan. SetCaching (10000);
Scan. SetCacheBlocks (false);
The conf. Set (TableInputFormat INPUT_TABLE, tableName);
ClientProtos. Scan proto=ProtobufUtil. ToScan (Scan);
String ScanToString=Base64. EncodeBytes (proto. ToByteArray ());
The conf. Set (TableInputFormat. SCAN, ScanToString);
JavaPairRDD
NewAPIHadoopRDD (conf, TableInputFormat. Class,
ImmutableBytesWritable. Class, and the Result. The class);
Used in Spark as Hadoop provides a standard interface read HBase table data (read full table), read about 500 million data, to 20 m +, and the same data stored in the Hive, read only need less than 1 m, the performance difference is very big,
Now project has basic selection to use HBase as big data storage, and the performance of the Spark read HBase data is so slow, has been attempting to directly read HFile, but read only the parsing a HFile file performance is also very slow (400 m data about 90 s), and there is no other solution? Can't don't Spark to Hbase as basis for storage?
CodePudding user response:
Could you tell me your problem solved? You can share the experience?CodePudding user response:
The latest hbase has provided the spark interfaceCodePudding user response:
Look at the HBase Doc