When it comes to big data, I believe you to Hadoop and Apache Spark the two name is not strange, but we often understanding of them is drawn in the literal, not in-depth thinking about them, see them below a piece might as well tell me what are the similarities and differences,
the level to solve the problem is not the same as
First of all, Hadoop and Apache Spark both are big data frame, but their purpose is not the same, Hadoop more essentially is a distributed data infrastructure: it will be huge data sets assigned to a composed of a normal computer cluster multiple nodes for storage, means that you do not need to purchase and maintain expensive server hardware,
At the same time, Hadoop can index and tracking the data, make big data processing and analysis efficiency to an unprecedented height, the Spark, it is then a dedicated to the distributed storage of large data processing tools, it will not be distributed data storage,
and can be divided into two can
Hadoop is consensus by all, in addition to the Hadoop distributed data storage function, called also provides graphs of the data processing function, so here we can put aside the Spark, the graphs of the Hadoop itself to accomplish the processing of data,
The following excerpt from the Internet on the interpretation of the most simple graphs:
We are going to count all the books in the library, you are number 1 shelf, I count 2 shelf, this is the "Map", we are, the more the number of books is faster,
Now we are together, add the count of all together, this is the "Reduce",
On the contrary, the Spark is not necessarily dependent on Hadoop to survive, but as mentioned above, after all, it does not provide a file management system, so it must and other distributed file system can set up operations, here we can choose the HDFS Hadoop, you can also choose other data system based on cloud platform, but the Spark is still used in Hadoop above default, bluntly, everyone thought their combination is the best,
Spark data processing speed down the graphs
Spark because of its different way of dealing with the data, will be a lot quicker than graphs, graphs is step by step to deal with the data of: "to read data from the cluster, a processing, writes the results cluster, read the updated data from the cluster, the processing of the next, writes the results cluster, etc..." Booz Allen Hamilton data scientists Kirk Borne so parsing,
In contrast, the Spark, it will be in memory at close to "real time" to finish all of the data analysis: "to read data from the cluster, complete all the necessary analysis, results will be written back to the cluster, complete," Born says, the Spark of batch nearly 10 times faster than graphs, analysis of data in memory is nearly 100 times faster speed,
If need to deal with data and results requirements in most cases is static, and you also have the patience to wait for the batch of finish, the treatment of graphs is also perfectly acceptable,
But if you need data analysis, such as those from the factory of the data collected from sensors, or your application is the need for multiple data processing, so you may be the Spark should be used for processing,
Most machine learning algorithms are need multiple data processing, in addition, usually use Spark application scenario has the following aspects: real-time market activities, online product recommendation, network security analysis, diary and monitoring machine,
disaster recovery
Disaster recovery of the two different ways, but are all very good, because the Hadoop will after each processing of the data is written to the disk, so it can be very natural elastic to processing of system error,
Spark of data object is stored in the Distributed in the data in the cluster is called elastic Distributed data sets (RDD: Resilient Distributed Dataset), "these data objects can be placed on the memory, also can put on disk, so RDD also can provide complete disaster recovery function," Borne points out,