I have a rest API which retrieve a lot of rows from HBase with a java client.
I have implemented the following code to convert the hbase rows to an array of arrays.
List<List<String>> finalList = new ArrayList<List<String>>();
ResultScanner scanner = table.getScanner(scan);
List<String> rowSection = new ArrayList<String>();
try {
for (Result result = scanner.next(); (result != null); result = scanner.next()) {
rowSection.clear();
for (Cell cell : result.listCells()) {
String value = Bytes.toString(CellUtil.cloneValue(cell));
rowSection.add(value);
}
finalList.add(rowSection);
}
System.out.println("ARRAY SIZE: " finalList.size());
} finally {
if (scanner != null) {
scanner.close();
}
}
I can easily convert it to an array of objects, but I'm not sure if it's the right way to convert the hbase result.
Is there a more performant way to do it or are there some functions to it automatically?
CodePudding user response:
I suspect the key performance bottleneck in your code is that you are doing scanner.next() without specifying how many rows you want returned. This way, you will be doing a round-trip to the cluster for each row. It's better to indicate how many rows you want returned at once, so they will all be packed into one RPC call. The right parameter here depends on how large your rows actually are (i.e. column content) and your memory on the server. Play around with this setting, but I normally use thousands or tens of thousands for small rows. You should see a significant speedup this way.
Edit with an example for clarity:
This will return 1 row with all columns for that row in one RPC to the server:
Result result = scanner.next()
This will return 20 rows, which all related columns for each row, all in a single RPC to the server:
Result result = scanner.next(20)
If fewer than 20 rows actually exist that satisfy your scan, then the scan will return whatever is available.
So try doing a two-level loop: outer level checks for the result being null or smaller than what you asked for (i.e. no more results available) while the inner loop actually loops through all rows returned per each RPC. I haven't tested this, but something like this:
int HOW_MANY_ROWS = 20
Result result[];
do
{
result = m_scanner.next(HOW_MANY_ROWS);
for(Result row : result)
{
byte rowKey[] = row.getRow());
// rest of your code for this row goes here
}
if(scanRes.result < HOW_MANY_ROWS)
break;
}
Something like this should be a lot faster than what you have now, particularly if your scans have to return thousands of rows.
CodePudding user response:
First of all there is a bug: finalList
is filled with a single object multiple times, at the end containing the last record multiple times. That should be done as:
List<List<String>> finalList = new ArrayList<>();
ResultScanner scanner = table.getScanner(scan);
try {
for (Result result = scanner.next(); (result != null); result = scanner.next()) {
List<String> rowSection = new ArrayList<>();
Then there is the keeping of one row. It could be List<String[]>
(if you know the number of cells, or whatever, Object[], or your own RowSection class, with int, String and other fields.
As cloneValue
seems to return a byte[]
simply use that.
List<byte[]> rowSection = new ArrayList<>();
byte[] value = CellUtil.cloneValue(cell);
rowSection.add(value);
There could be more to improve.