Home > database >  Three important papers read Google inductive
Three important papers read Google inductive

Time:10-12


First paper Bigtable, he is a structured data a distributed storage system, it is designed to deal with massive amounts of data on many projects will use Google Bigtable, but their demand is not the same, in spite of this, Bigtable can still provide a flexible, high-performance solution
Bigtable via keyword dictionary in order to organize the data line, the Tablet is the minimum unit of data distribution and load balancing adjustment, when the operation is only read a few columns in the rows of data when the efficiency is very high, generally only need little communication between the machine several times to complete; By keyword set is composed of the basic unit of access control, all data stored in the same column family usually belong to the same type, the family of columns in a table column family cannot too much, access control, the use of the disk and memory statistics are to be carried out at the column family level, customers can through the API function to write or delete the values in the Bigtable Bigtable, find value from each line, or through a subset of data in the table
To record the distribution of the Tablet is a Master server, to adjust the loading of relation between each Tablet, in order to improve the performance of read operation, the Tablet server using the second level cache strategy, scanning the cache is the first level cache; Block buffer is the second level cache,
A personalized query need to use of Google Bigtable to store each user data, each user has a unique id, each user's id and a column name binding, a separate column family is used to store various types of behavior, personalized query USES the MapRedure task generation based on Bigtable storage user data graph, the user data chart used to personalize the current query results,
In the design, implementation, maintenance and support the process of Bigtable, we got a lot of experience, one is many type error will cause large distributed system is damaged, we have to deal with these problems by changing the agreement; Another is we want in a thorough understanding of how a new features will be used later, and then make a decision whether to add this feature; One is we found the system level monitoring is very important for Bigtable, for example, we expanded our RPC system, it allows us to detect and correct a lot of problems, in addition the most to gain is the value of our simple design, concise design and coding to bring huge benefits, maintain and debug
The second paper mainly introduces graphs, it is a programming model, is a processing and generating large data sets of the relevant implementation algorithm model, graphs this program can be implemented in a large number of common configuration computer parallel processing, the causes of graphs is when we need to input data is huge appear many problems, and it appears that we do not have to care about these complex problems, because the graphs encapsulates the parallel processing, fault tolerance and data locality optimization, load balance, etc. The details of the technical difficulties, make graphs library easy to use, it can be used to sort, data mining, machine learning, and many other systems, we learned many things from the development process of graphs, the first is the constraint programming model makes parallel and distributed computing is very easy, also easy to construct the fault-tolerant computing environment, the second is that it saves network width, and the third is to solve the problem of data loss due to machine failure,
Is the third paper is GFS, GFS in we re-examine the traditional file system on the design trade-offs, derived completely different design train of thought, it completely meet our demand for storage, in the process of construction and deployment of GFS, however, we experienced all sorts of problems, some are technical, some are operational, but found a corresponding solution
  • Related