HBase RowKey design principles-CodePudding

1, an overview of the
HBase is three-dimensional orderly storage, through rowkey (key), the column key (the column family and the qualifier) and TimeStamp (time stamp) the three dimensions of data in HBase can be rapid positioning,

HBase rowkey can uniquely identifies a row, at the time of HBase query, has the following several ways:
Through the get method, specify the rowkey get only one record
Through setting while the startRow scan mode, and the scope of stopRow parameters matching
A full table scan, namely direct scanning the entire table all rows in the

2, the principle of rowkey length
Rowkey is a binary code flow, can be any string, maximum length 64 KB, generally in the actual application of 10-100 bytes, save in the form of byte [], general design into fixed-length,

Suggested that the shorter the better, not more than 16 bytes, reasons are as follows:
Persistence of data files are stored in accordance with the KeyValue HFile, if rowkey is too long, such as more than 100 bytes, 1000 w row data, light rowkey will take up to 100 * 1000 w=1 billion bytes, nearly 1 g data, this will greatly affect the HFile storage efficiency;
MemStore will cache data to memory, if rowkey field is too long, the memory will reduce the effective utilization, system cannot cache more data, it will reduce the retrieval efficiency,
The current operating system is a 64 - bit system, memory 8 byte alignment, control in 16 bytes, 8 bytes integer times using the best features of the operating system,

3. Rowkey hash principle
If rowkey according to increasing timestamp way, don't put time in front of the binary code, the proposal would rowkey highs as hash fields, randomly generated by the program, low time field, this will improve the data distribution in each RegionServer, in order to realize the load balance of risk, if no hash fields, the first field directly is information time, all the data will be concentrated in a RegionServer, at the time of data retrieval load will be focused on individual RegionServer, hot spot problem, can reduce the query efficiency,

The only principle 4, rowkey
Must ensure its uniqueness in design and rowkey is according to the dictionary stored in order, therefore, design the rowkey should take full advantage of the characteristics of this sort, will often read into a data storage, and the latest data may be accessed on a piece of,

5, what is the hot spot

Row in HBase is in accordance with rowkey dictionary order, this design optimizes the scan operations, can be related to the line and will be read access in adjacent location, easy to scan, however bad rowkey design is the source of hot hot spots occur in a large number of client direct access to one or a few node of the cluster (access may be read, write, or other operations), a large number of access will make hot region in a single machine beyond their bear ability, cause performance degradation and even region is not available, it will also affect other region on the same RegionServer, due to host cannot service the request of the other region, a well-designed data access pattern in order to make the cluster was full, balanced use of,
To avoid writing hotspot, design rowkey make different in the same region, but in the case of more data, the data should be written into multiple clusters region, instead of one,
Here are some of the common method to avoid hot spots and their advantages and disadvantages:

5.1 salt
This salt is not salt of cryptography, but in front of the rowkey increase random number, concrete is a random prefix to assign rowkey makes it different, and before the beginning of the rowkey distribution prefix number should be and what do you want to use the data correspond to the amount of dispersed to different region, add salt after rowkey prefix will be based on the randomly generated dispersed to each region, in order to avoid hot spots,
5.2 the hash
Hash will keep the same line of salt with a prefix, hash can scatter the load to the whole cluster, but reading is predictable, used to determine the hash can let the client to reconstruct complete rowkey, can use the get operation accurately acquiring one row of data
5.3 reverse
The third way to prevent hot spots when reversing rowkey of fixed length or digital format, so that can make a part of the rowkey changes often meaningless part () in front, so that we can effectively random rowkey, but at the expense of rowkey orderliness,
Examples of reverse rowkey rowkey for mobile phone number, mobile phone number can be reversed after the string as a rowkey, this avoids the begin with a phone number that is fixed in hot topic
5.4 the timestamp inversion
A common problem of data processing is fast to get the data of recent versions, use reverse timestamp as part of the rowkey to this problem is very useful, can use Long. Max_Value - timestamp appended to the end of the key, for example [key] [reverse_timestamp], [key] the latest value can scan [key] gain [key] the first record, because HBase rowkey is ordered, the first record is the last entry of data,
Such as the need to keep records of a user's operation, in accordance with the operating time order sorting, rowkey in the design, can be designed so that
[userId reversal] [Long. Max_Value - timestamp], at the time of query all of the user's operation record data, specified directly after inversion userId, while the startRow is [userId reversal] [000000000000], stopRow is [userId reversal] [Long. Max_Value - timestamp]
If you need any query operation records over a Long period, while the startRow is [user reversal] [Long. Max_Value - starting time], stopRow is [userId reversal] [Long. Max_Value - end time]
Some other advice
To minimize the size of the row and column in the HBase, value with its key transmission forever, when the transmission between specific value in the system, its rowkey, column names, timestamp also can transfer together, if your rowkey and column names is very big, can even compared with specific values, then you will have some interesting problems, HBase storefiles in the index (help random access) dominated the HBase allocate large amounts of memory, because of the specific value and its key is very big, can increase the block size makes storefiles index to a larger time interval increase, or modify the model to minimize rowkey and column names of the table, the size of the compression can also help more indexes,
Column family as much as possible, the shorter the better, had better be one character
Lengthy attribute name good readability, but shorter attribute names stored in HBase will be better

CodePudding user response:

Summary is very detailed

CodePudding user response:

Studied the

CodePudding user response:

Summary is very detailed, learning,