Chapter 11 big data technology and practice-CodePudding

1. Big data definition: it is more of an emphasis on big data 'big', big data is relative, is refers to the processing of data too large to treatment of the current mainstream database software tools, fetching in acceptable time, storage, management and analysis, and extract the human understandable information,
2. HDFS replica placement strategy: in HDFS cluster, passed by the namenode Hadoop rack perception each DataNode's frame, the optimal strategy is a simple rather than put all copies of the same piece of data in different rack, such doing can guarantee the reliability of data, the crash of the whole frame frame of data is available on the other, and to do so when reading the file make full use of network bandwidth on each rack, achieve load balance, but this strategy exist the following problems:
When writing the price is too big, need a large amount of data transmission between different frame,
When a copy of the local data failure, restore data from the remote node requires a lot of data transmission time,
Random data storage node, may cause uneven load of data storage,
3. HBase introduction: in a data column type storage format, APache HBase is inspired by GoogleBigTable thoughts and hair runs in Hadooo database platform, is scalable, distributed data storage system, HBase is an open source, the data stored in version, more column oriented data storage platform,
4. HBase architecture: HBase cluster generally consists of a HMaster, multiple HRegionServer component, the whole cluster by Zookeeper as the synchronous coordinator,
The client use HBase communicate with HMaster and HRegionServer RPC mechanism,
Zookeeper is the whole cluster running synchronous coordinator,
HMaster no single point problem, HBase can launch multiple HMaster, through a ZooKeeper to ensure there is always a master host running, in HBase HRegionServer is the core module, mainly be responsible for the response to a user I/O request, to the HDFS file read and write data in the system, each HStore corresponds to a column family in the table storage,
5. Introduction of Cassandra
It is based on Amazon's fully distributed chateau marmont, combined with the Google BigTable data model based on column family, p2p decentralized storage, use of the current in the Twitter and Digg,
Cassandra data model: the bottom part of the column is the increasing data, it is a name, value, and timestamp triple tuple,
Partition strategy: in Cassandra, the key of the Token is used to partition the data, each node has a unique Token, indicates that the node distribution range of data, using a consistent Hash partitioning, key/value pair will be judged by consistent Hash data should belong to which Token,
6. Redis description
Redis is a key to the types of data distributed no database system, its characteristic is high performance, persistent storage, can adapt to the high concurrency scenarios, Redis individual worth maximum limit is 1 gb,
Redis distribution patterns: master data synchronization to the slave, and slave data will not be synchronized to the master, a slave when they start connecting the master to synchronize data,
Separation, speaking, reading and writing of the defects in the model both the master and slave, each node must save the complete data, under the condition of the amount of data is very big, the expansion of the cluster storage capacity is limited to a single node, each master can be designed to consists of more than a master and slave model,