Chapter 11 big data technology and practice-CodePudding

Hadoop framework design is the core of graphs and the HDFS, the thought of the graphs is mentioned in a paper by Google was widely circulated, simple sentence explain graphs is "task decomposition and results summary", HDFS is Hadoop Distributed wen (Hadoop Distributed FileSystem), provide the underlying support for Distributed computing, storage, graphs can be seen from the name on it is roughly a reason, Map and Reduce two verbs, "Map (a)" is a task into multiple tasks, "Reduce" is to decompose after multitasking results summary, it is concluded that the final analysis, it was not a new ideas, actually in the aforementioned multithreading, the design of multitasking can find the idea of shadow, whether it is a realistic society, or in programming, a job can often be broken up into multiple tasks, the relationships between the tasks can be divided into two kinds: one kind is not related tasks, can be executed in parallel. Another kind is to have mutual dependence between tasks, order cannot be reversed, and this kind of task is not parallel processing, back to the college, professor let everyone in class to analyze the critical path, is to find the most direct way of execution task decomposition, in distributed systems, machine cluster can be regarded as the hardware resource pool, the parallel task split, and then passed on to every idle machine resources to deal with, can greatly improve the efficiency of calculation, this resource independence at the same time, for the expansion of the compute cluster provides the best design that
Batch computing: in view of the batch in large-scale data, graphs can be executed in parallel on a large scale data processing tasks, parallel computing for large-scale data sets (single input, two phase, coarse-grained data parallel distributed framework), it will be complicated, running in the process of the large-scale cluster parallel computing highly abstract by two functions, Map and Reduce, and cut a large data set into multiple small data sets, distributed to different machines for parallel processing, bringing great distributed programming work, in graphs, data flow from the source of a stable, after a series of processing flow to a stable file system (HDFS),
Spark is a low latency for large data sets of clusters distributed computing systems, which enable the distribution of memory data sets, can provide interactive query optimization iteration workload, in graphs, data flow from the source of a stable, after a series of processing flow to a stable file system (HDFS), while the Spark use memory instead of HDFS or local disk to store intermediate results, so much faster,
Flow calculation: flow data (or data flow) refers to the time distribution and the number of infinite series of dynamic data collection, data value with the passage of time is reduced, so must be real-time calculated given grade second response, the industry has many flow computing framework and platform: the first category, commercial grade flow computing platform (IBMInfoSphere Streams, IBM StreamBase, etc.); The second category, open source computing framework (Twitter Storm, S4, etc.); Third category, the company in order to support their business development flow computing framework,
Figure calculation: such as Pregel, Giraph GraphX, PowerGraph,
Query analysis and calculation for large scale data storage management and query analysis, need to provide real-time or near real-time response, such as the Dremel, Impala,
The key technology of cloud computing, cloud computing, virtualization, distributed storage, distributed computing, multi-tenancy (extended data isolation, custom configuration, architecture, performance customization), such as
The Internet of things: the technical architecture of the Internet of things: perception layer, network layer, processing layer, application layer,
The key technology of Internet of things: recognition and perception technology, network and communication technology, data mining and fusion technology,
The relations between and among: 1, the difference between: big data storage for huge amounts of data, on processing, analysis, found that the value, the service life; Nature of cloud computing aims to integrate and optimize all kinds of IT resources and in the form of services through the network, provide users with low-cost; Iot development goal is to achieve something is linked together, innovation is the core of the development of the Internet of things,
2, contact: permeate each other, each other well,