Hadoop dealing with multiple database data-CodePudding

Scenario: we have many sets of the MySql database on the network, and we want to use Hadoop processing data in the database,
For how to deal with now there are two kinds of opinions:
1. To remove the data from Mysql, then imported into the Hadoop cluster (cluster and the Mysql database server is not in the same place), then, will handle the result is stored in the Mysql database or Hbase,
2. Use Hadoop database operation, directly using Hadoop DBInputFormat read data in the database, the concrete is such, is the Mysql database server, deploy into a Hadoop cluster nodes, in Hadoop distribution task, the task assigned to have corresponding data nodes, in through the API data, and perform data analysis operation,
I now is to think that, the second is not feasible, but most of my classmates and teachers to support the second, I just want to ask, which is more suitable for the two should be, there is a second, I remember that the database API operation, is when the Job starts already configured, a data in the database can be used at a time, times connected DOMAIN tnet. Hk
O, thank you
More 0

CodePudding user response:

With sqoop the mysql data directly imported into the hadoop, our company did, nor slow