Home > other >  Chapter 11 big data technology and practice
Chapter 11 big data technology and practice

Time:10-11

1. The big data characteristics of 4 v
1) Volume (data) : the primary characteristic of big data is the data Volume is huge,
2) segments (data type) : big data challenge is not only large amount of data, also reflected in the data type of diversification,
3) Velocity (processing speed) : the value of information is timely, over a specific time limit information lost use value,
4) the Value (Value) : big data business Value is high, but the Value of low density,

2. The big data storage platform

HDFS
1) introduction to
Hbase is a high reliability, high performance, and the columns, scalable, distributed database, real-time, speaking, reading and writing based on Hadoop - HDFS as a file storage system, the use of graphs to handle huge amounts of data, with a Zookeeper as a distributed collaborative services, mainly used to store the loose of structured and semi-structured data (column no (database),
Hadoop Distributed File System (HDFS) refers to the hardware is designed to be suitable for operation in general (commodity hardware) on a Distributed File System (Distributed File System), it and the existing Distributed File systems have a lot in common, but at the same time, the difference between it and other Distributed File System is also very obvious, HDFS is a highly fault-tolerant System, suitable for deployment in the cheap machine, HDFS can provide high throughput data access, very suitable for the application of large data sets, part of HDFS eased the POSIX constraints, to achieve the goal of streaming data read a File System, HDFS as Apache Nutch search engine project from the start of the infrastructure and development, HDFS is Apache Hadoop Core part of the project,
HDFS has the characteristics of high fault tolerance (fault - tolerant), and designed to deploy on the cheap (low cost) hardware, and it provides high throughput (high throughput) to access the application's data, suitable for those with very large data set, data set) application, HDFS eased (relax) POSIX requirements (requirements) can realize the flow in the form of access (streaming access data in a file system),
2) the HDFS architecture
HDFS using the master-slave structure model (Master/Slave), an HDFS cluster is composed of a NameNode and several DataNode, among them the NameNode as the primary server, management of the file system namespace and client access to the file; The data stored in the cluster DataNode management,
3) HDFS replica placement strategy
HDFS replica placement strategy is very important for the reliability and performance of HDFS, replica placement strategy relates to the reliability of data, availability and utilization of network bandwidth,
When reading the file make full use of network bandwidth on each rack, achieve load balance, but this strategy exist the following problems:
1) in writing when the price is too big, need a large amount of data transmission between different frame,
2) when the local copy of data failure, restore data from the remote node requires a large amount of data transmission time,
3) randomly selected data storage node, may cause data storage load balancing,
Therefore, based on node network distance and data load balance to select the best remote frame data copies the placement of nodes, it can realize data storage load balancing, and can achieve good performance of data transmission,

HBase
1) introduction to
HBase is a distributed, facing the open source database, the technology comes from Fay Chang wrote Google paper "Bigtable: a structured data distributed storage System", as used the Google Bigtable File System (File System) provided by the distributed data storage, HBase on Hadoop provides ability, similar to the Bigtable HBase is Apache Hadoop subprojects, HBase is different from general relational database, it is a suitable for unstructured data storage database, is a different HBase column based rather than based on the model,
2) feature
HBase features include: linear and modular scalability; Strictly speaking, reading and writing is consistent; Automated and configurable data table fragmentation mechanism; Can be hot standby switching between RegionServer; Provide work HBase table graphs with Java foundation classes; Easy to use API Java client access; Supports real-time query block of data caching and fuzzy filtering; Provides network management and TEST - ful Thrift Web services, and support for XML, Protobuf and binary coding; Extensible Jrubyshell; Supported by Hadoop
Detection subsystem or JMX export data to a file, Ganglia cluster detection system,
3) architecture
Tell from the physical structure, HBase consists of three types of server master-slave structure, Region Servers provide services for data read and write, when access to the data, the client directly and Region Servers communication, Region distribution, DDL (create and delete tables) operation with HBase Master process processing, Zookeeper is part of the HDFS, maintains a cluster of activities,
Hadoop DataNode stores the Region from the data management Server, all the HBase data stored in the HDFS file, Region Server and HDFS DataNode apposition together, which makes the RegionServers data with data service by the local data (close to the location of the need to), HBase data in writing is the local data, but when the moving Region, it is not a local data before compaction,
The NameNode maintain a file of all metadata information of physical data block,
4) data model
Cell (cell)
(1) cell is determined by the row and column coordinate cross;
(2) the cell is the version of the;
(3) the contents of a cell is unresolved byte array;
(4) cell by {row key, and column (=& lt; Family> + & lt; Qualifier>) Version} the only definite unit,
(5), there's no types of data in a cell, all bytecode form storage,
RowKey
(1) to determine the row data, according to the row to retrieve data, equivalent to the primary index
(2) according to the dictionary order that data is ordered
(3) can store 64 k bytes of data, RowKey as short as possible
Column Family (Column Family)
(1) of each column in the HBase tables belong to the family of a column, the column family must as part of the definition of table model (schema) is given in advance,
The column name prefixed by a column family, each "column family" can have multiple columns (column); New column family members (column) can then on-demand, dynamic join;
(2) access control, storage, and tuning is carried out at the column family level;
(3) the same HBase column family inside the data is stored in the same directory, kept by several files,
Time stamp (Timestamp)
(1) every cell in the HBase storage unit can have multiple versions of the same data, according to the time stamp will distinguish the difference between each version, different versions of the data sorted by reverse chronological order, the latest versions of the data line in the front,
(2) the type of the timestamp is a 64 - bit integer, generally by HBase (automatically when writing data) assignment, the timestamp is accurate to the milliseconds of the current system time,
Timestamp can also explicitly assigned by the customer, if the application version to avoid data conflict, it is necessary to generate its own has the uniqueness of the timestamp,

Cassandra
1) introduction to
Cassandra is no a set of open source distributed database system, it was originally developed by Facebook, used to store the inbox and simple format data, set GoogleBigTable data model with Amazon in a fully distributed architecture of chateau marmont Facebook will Cassandra open source in 2008, since then, due to Cassandra good extensibility, Digg, Twitter and other well-known Web 2.0 site adopted, became a popular distributed structured data storage solution,
Cassandra is a hybrid of non-relational database, similar to Google's BigTable, its main function than the chateau marmont (distributed storage systems, the Key - Value) more rich, but not support document storage mongo (between relational databases and the relational database of open source products, is the most abundant function of relational database, the most like a relational database, data structure of the support is very loose, is similar to a json bjson format, so you can store more complex data types), Cassandra was originally developed by Facebook, after conversion into open source projects, it is a social ideal of cloud computing database, on the basis of the Amazon's proprietary completely distributed chateau marmont, combined with the Google BigTable based on Column Family (Column Family) data model, P2P decentralized storage, many ways can be called chateau marmont, 2.0

2) data modelCassandra adopt similar to HBase data model, have HNase column and column family mechanism, at the same time have their own super columns and super column family,
Bottom column is the increasing data (minimum), it is a name (name), value (value) and time stamp (timestamp) triple tuples,
Super is the difference between the column and column, the column value is a byte array, and the column value contains more than one column, and no timestamp super columns, each column in the super columns timestamp can be different,
3) partition strategy
Token is used to partition the data key, each node has a unique Token, indicates that the node distribution range of data values is the only

Redis
1) introduction to
Redis is one of the current popular uncomfortable realization, it is an open source using ANSI c language to write the key - value storage system (different from MySQL stored in the form of two-dimensional form), similar to Memcache, but greatly compensate the deficiency of the Memcache, like Memcache, Redis data is cached in computer memory, the difference is that the data cached in memory only Memcache, and cannot be automatically written to disk regularly, this means that a power outage or reset, memory is cleared, data loss, so the Memcache application scenario applies to cache without persistent data, and Redis is different is that it can periodically updates the data to disk or additional log files, to modify the operating write to implement the data persistence,
Features:
Redis reading speed is 100000 times/s, write speed is 81000/s
Atoms, Redis all operations are atomic, Redis also support for atomic execution, after several operations and all
Support for multiple data structures: string (string); The list (list); Hash (hash), set (set); Zset sets (orderly)
Persistence, master-slave replication (cluster)
Support expiration time, support transactions, message subscription,
The official does not support Windows, but the third party version,

Mongo
1) introduction to
Directing is a database based on distributed file storage [1], written by c + + language, aims to provide scalable WEB application of high-performance data storage solutions,
Directing is a between relational databases and the relational database products, is the most abundant function of relational database, the most like a relational database, it supports data structure is very loose, is similar to the json bson format, so you can store more complex data types, Mongo's biggest characteristic is that it supports query language is very strong, its syntax is a bit similar to the object-oriented query language, can achieve almost similar to relational database for the most part of single table query function, but also support for data indexing,
Features:
Data collection for storage, easy to store the type of an object,
Model free,
Support for dynamic queries,
Support completely index, including internal objects,
Support queries,
Support replication and failover,
Using efficient binary data storage, including large objects (e.g., video, etc.),
Automatic processing pieces to support cloud computing level of scalability,
Support Golang, RUBY, PYTHON, JAVA, c + +, PHP, c # and so on the many kinds of languages,
nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
  • Related