Some thoughts on triggered by Hadoop-CodePudding

Hadoop is the mainstream of the widely accepted architecture, its design concept from the hardware level, there are two important philosophical concept:
1) Share Nothing -- that is completely independent of each hardware node, the CPU/memory/this site completely private, in pursuit of the maximization of the efficiency of each node,
2) Data Exchange through distributed file system, Data sharing by the distributed file system to realize the indirect remote access (a node needs to access remote Data block to the DataNode agent implementation),

The system is working well, the biggest advantage is "Scalability" almost can be extended to infinity,

But few mention on the BBS Hadoop outside of big data analysis framework - whether it is the only large data analysis architecture choice? Unlimited scalability is data analysis or only the most important goals?

The first thing to ask is: you really very big data? Really need to tens of thousands of sets of full of hard disk server to hold?

Second is to ask: if you need multiple iterations of big data analysis step by step optimization, each iteration data are greatly reduced?

Need to ask is: you again big data are real time? Between the processing nodes to the analysis of the data is very small, or swapping a lot?

Montana topic, welcome to come talk,

CodePudding user response:

Well, the big data in the field of analysis definitely ~ there is a big advantage

CodePudding user response:

Microsoft, Yahoo's statistical data show that their Hadoop fleet of input data is only an average of 14 gb, Facebook data is 90% of the Hadoop task under 100 gb,
Many large data analysis tasks, can put down the data in a single server memory, if using a single frame in memory of all servers, I'm afraid that cover most applications,
Therefore, In the Memory of the big data analysis platform become the new hot oh, such as UC Berkeley Spark system is optimized for this kind of data analysis,

CodePudding user response:

Agree with the original poster, big data analysis has great value, it is no doubt, but now the mobile Internet era, various application scenarios, new business models is also emerge in endlessly, so need analysis framework is also difficult to achieve a winner-take-all, so now is suitable for real-time analysis and iterative Spark a fire,

CodePudding user response:

Say I understand, are still relatively shallow, discuss together, I think the Hadoop maximum value is to provide a high scalability of distributed file storage system, when the data reach the level of PB, performance available to do the design of the storage structure is almost impossible, if you don't have to Hadoop or similar solution you must design a storage structure, it may take you a lot of time, and your experience and knowledge also limits the successful design is really such a storage structure, can use it to existing relational database storage but you can also in the design of a database for your own project, design a distributed storage architecture not difficult than yourself at how easy to design a relational database,

Also want to discuss with the original poster you ask a few questions:

The first thing to ask is: you really very big data? Really need to tens of thousands of sets of full of hard disk server to hold?
I think we also need to think about a problem, full of hard disk several or a dozen Hadoop server should realize the same performance than traditional architecture equipment much cheaper, so maybe several or a dozen sets the size of the application may be worth to use Hadoop, need not tens of thousands of sets of full of hard disk server so exaggerated, but also want to consider another cost is the cost of development, the development of the applications based on Hadoop is difficult,

Second is to ask: if you need multiple iterations of big data analysis step by step optimization, each iteration data are greatly reduced?
Efficiency of graphs is a real problem, I want to or by reasonable design of storage structure and algorithm can improve it,

Need to ask is: you again big data are real time?
Data in real time is also a problem really, so the data in a large system may should be included in the relational database, such as hadoop, according to cooperate with each other,

Between the processing nodes to the analysis of the data is very small, or swapping a lot?
This can also be through the reasonable design to reduce the happening of this kind of situation, and even can adjust the data structures and algorithms to improve constantly similar problems,

CodePudding user response:

I think in the final analysis is the changes of the rack hardware architecture and updates, coupling calculation storage solution ~ ~ ~

CodePudding user response:

I feel a kind of architecture, must want to consider a lot of ways, perhaps ask the building Lord, they were considered, so the hadoop

CodePudding user response:

Hadoop such an architecture, although it is very successful, but there are a lot of improvement, it is impossible to consider comprehensively, and now she is still in development, such as Hadoop V2 out, away with a lot of flexibility, at the same time, such as Spark algorithm architecture also went on the Hadoop framework integration,
So, development/improvement this theme or not, we are with her at the same time, thinking about thinking about how to improve, or interesting,

CodePudding user response:

Hadoop, as a mainstream of software architecture, does have its rationality and practical considerations, CMIC your insights will contain a lot of practical considerations, try to summarize:
1) using the Hadoop ready-made storage architecture (HDFS) or other mature database or file system architecture for development and application is a realistic choice,
2) even in small fleet, Hadoop also offers a feasible parallel processing architecture, and is cheaper than traditional stand-alone equipment such as high-performance servers,
3) Map reduce the efficiency of the data exchange, introduced the iteration, and real-time Hadoop possible performance efficiency problem, can through the data structure and data display adjustment and other software optimization method to compensate for, to achieve the purpose of practical,
Feel CMIC is an experienced software architects, from the perspective of the software architect, hardware is holy and cannot change? I was very interested, whether in the world of software engineers, OS/platform is established, and application/algorithm is variable, so when problems occur, will skip hardware/OS/Hadoop, and focuses on its application to run on, in an attempt to improve performance? If out of the framework is to think, will there be a totally different answer?

CodePudding user response:

refer to 6th floor coolbamboo2008 response:

I felt a kind of architecture, must want to consider a lot of ways, perhaps the building Lord asked, they were considered, so the hadoop

Whether will disagree, "there must be reasonable?" There is no denying the fact that the Hadoop in its birth period and in the past few years development, provide or fully realize the value of its design, but in today's rapidly changing environment, its disadvantages are also increasingly prominent, the original poster is agreed with cdb81 about dealing with the view of the evolution of model updating cdb81 mentioned now became so the Spark is an example of the fire,