Home >
other > Chapter 11 big data technology and practice
Chapter 11 big data technology and practice
1. The definition of big data:
Definition given by the McKinsey global institute is: a large scale to the acquisition, storage, management, analysis of well beyond the traditional scope of data collection, database software tools with mass of data size, fast data transfer, a Variety of data types, and Value of four big features of low density, large amount of data (Volume), change speed (Velocity), g (appearance), and high Value (the Value),
2. The big data storage platforms:
(1) the HDFS (Hadoop distributed file system)
Problem: when the need to store a lot of data is very large, a machine can't storage, data needs to be cut into pieces, the use of multiple machines distributed storage the data
HDFS has three main roles:
The NameNode (hereinafter referred to as the NN is responsible for the controller, the original data storage, data between is not responsible for customer read)
SecondaryNameNode (hereinafter referred to as SNN)
DataNode (hereinafter referred to as the DN real data storage)
The NameNode's main job is and the Client (Client), to accept their request, as well as management of metadata and cluster (DataNode), the NameNode metadata stored in memory, but memory is not stable, if the NN will yuan data to disk, it goes through complex, IO operations and lock memory, not foreign services, the efficiency is greatly reduced, the whole system is obviously not desirable, then appeared SecondaryNameNode,
SecondaryNameNode is responsible for NN share the quota, except for the block of persistent memory location information of all metadata,
DataNode's job is to store block data, provide the Client with a read operation,
(2) the backup mechanism
Backup mechanism is to ensure data security, the default data backup HDFS number is 2, when enabled the backup mechanism, DataNode must be greater than or equal to the number of data backup number + 1,
Backup copy of the first (1) block is stored in the block with the original documents (source) on different random a server rack
The second copy (backup 2) stored in the first copy of the same frame of other random a server
3. HBase (Hadoop distributed, open source, multiple versions of the relational database)
HBase cluster generally consists of a HMaster, multiple HRegionServer, chose the consistency and partition tolerate sexual
(1) the Client
Client has access to Hbase interface, will go to the meta table query target region location (this information will be put into the cache), and connect the corresponding RegionServer for data reading and writing,
When master rebalance region, the Client will to lookup,
(2) the Zookeeper
HMaster and RegionSerer are registered to ZK, make HMaster perceived RegionServer line from top to bottom,
Elections, to guarantee the HMaster HA,
Save, position RegionServer META. Table of
(3) HMaster
Monitoring RegionServe state, and for the distribution of Region, in order to maintain the whole cluster load balancing,
By HMasterInterface interface maintenance cluster metadata information, manage users on the table to add and delete operations,
Region Failover: find the failure Region, to the normal RegionServer to restore the Region,
RegionSever Failover: by HMaster its region for migration,
(4) HRegionServer
Main responsibilities:
Handle user to read and write requests, and the underlying HDFS interactions, we have a certain region means the region said RegionServer, speaking, reading and writing, such as flush operation is managed by the current RegionServer, if the RegionServer local no HDFS DataNode underlying data is remote from other DataNode node, speaking, reading and writing,
Responsible for the Region become larger after the split,
Responsible for Storefile merger,
HRegionServer components:
A RegionServer there are multiple Region and a HLog, speaking, reading and writing as an example,
HLog is a WAL (the Write - Ahead - Log), equivalent to an RDBMS redoLog, will Write a Write data into HLog, can configure MultiWAL, parallel Region using multiple pipelines to Write multiple WAL flow,
A RS share one HLog reason is to reduce disk I/o costs, reduce disk seek time,
Region belong to a table level split a Region (initial), the results of each table in the Region division to multiple RegionServer,
Region on the column family is divided into multiple Store
Each Store has a MemStore, when he has read and write requests to request MemStore
Every Store has multiple StoreFile
HFiles is the actual storage format of the data, he is a binary file, StoreFile to encapsulate HFile, HBase data in the underlying file to KeyValue stored in the form of key-value pairs, HBase data types, HFile is stored in bytes and the byte order arranged according to the dictionary,