How to through the API in your code to monitor the Hadoop, Spark task progress and result?-CodePudding

Company recently began to make big data project, so we have to 0 based learning, start now has on the basic theory of entry, can write simple calculation program run separately, but the project requirements is more complex, involving technology has several, mainly for Hadoop2 (HDFS, YARN, graphs), Sqoop, Flume, Spark1.6, overall process can be summarized as:
1. The inside of the Oracle data via Sqoop imported into the Hive/HDFS, through the Flume file data imported into the Hive/HDFS
2. Using the Spark program of the imported data calculation, calculation results is saved as a Hive table
3. The calculation results via Sqoop export to Oracle
Going to take a step to complete the whole process is serial, and can successfully into the next phase of operation, so the company to develop a Java version of the "total control" program, realize the control of each procedure, I'd like to, feel a little bit mission zapper, can through the Java Runtime to exec commands such as sqoop import/export, spark - submit, etc., but how Java program can monitor the hadoop graphs (because sqoop is the underlying hadoop graphs task), the task of the spark? It doesn't matter even if there is no schedule information, the key is to be able to get the task status and task is success or failure of information, in order to begin the next step of work,
Pray god taught!!!!!

CodePudding user response:

Hadoop onboard monitoring system,

Under the search.

CodePudding user response:

reference 1st floor alinly response:

hadoop onboard monitoring system,

Search.

You mean the Metrics? That seems can't meet my needs

CodePudding user response:

Using Oozie job scheduling framework

CodePudding user response:

Additional Spark can directly write JDBC df. Write (). The mode (mode). The JDBC (url, tableName, prop); You can also df. Collect () get List Then define your own Insert SQL statement, there is no need to via Sqoop derived from the Hive to Oracle

CodePudding user response:

reference 4 floor link0007 response:

additional Spark can directly write JDBC df. Write (). The mode (mode). The JDBC (url, tableName, prop); You can also df. Collect () get List Then to define your own Insert SQL statement, there is no need to via Sqoop derived from the Hive to Oracle

If the result set is very big collect is likely to be out of memory, and seemingly way through JDBC insert sqoop operation high efficiency? I didn't try,
Oozie framework needs to pig, and overall technical framework of the company didn't this... In addition Oozie support Spark Job? And sqoop import data to the Hive within Oozie make steps?

CodePudding user response:

reference 5 floor bangbong reply:

Quote: refer to 4th floor link0007 response:

Additional Spark can directly write JDBC df. Write (). The mode (mode). The JDBC (url, tableName, prop); You can also df. Collect () get List Then to define your own Insert SQL statement, there is no need to via Sqoop derived from the Hive to Oracle

If the result set is very big collect is likely to be out of memory, and seemingly way through JDBC insert sqoop operation high efficiency? I didn't try,
Oozie framework needs to pig, and overall technical framework of the company didn't this... In addition Oozie support Spark Job? And sqoop import data to the Hive within Oozie make steps?

Oozie job scheduler is a MR, can define MR Hive Sqoop Spark Java applications such as workflow, you can go to oozie.apache.org website to see,
Sqoop underlying and JDBC, as for you said collect easily OOM, apparently it can be for (Row Row: df. Collect ()) {//PreparedStatement Batch}, may also directly df. Write (). The JDBC () writes a temporary table, then run a Select into the target table SQL

CodePudding user response:

refer to 6th floor link0007 response:

Quote: refer to the fifth floor bangbong reply:

Quote: refer to 4th floor link0007 response:

Additional Spark can directly write JDBC df. Write (). The mode (mode). The JDBC (url, tableName, prop); You can also df. Collect () get List Then to define your own Insert SQL statement, there is no need to via Sqoop derived from the Hive to Oracle

If the result set is very big collect is likely to be out of memory, and seemingly way through JDBC insert sqoop operation high efficiency? I didn't try,
Oozie framework needs to pig, and overall technical framework of the company didn't this... In addition Oozie support Spark Job? And sqoop import data to the Hive within Oozie make steps?

Oozie job scheduler is a MR, can define MR Hive Sqoop Spark Java applications such as workflow, you can go to oozie.apache.org website to see,
Sqoop underlying and JDBC, as for you said collect easily OOM, apparently it can be for (Row Row: df. Collect ()) {//PreparedStatement Batch}, may also directly df. Write (). The JDBC () writes a temporary table, then run a Select into the target table SQL

In Oozie website saw, seemingly Oozie support is Hadoop 1. X? Configuration items have the jobtracker address what see me, I come this company using hadoop2. X... But when you're free can learn Oozie see, thank you!

CodePudding user response:

Your problem solved? I also have this doubt, learning

CodePudding user response:

Your problem solved? General idea can share

CodePudding user response:

Oozie and Azkaban should can be completed, a heavyweight a lightweight

CodePudding user response:

Oozie, support shell, Mr, spark, hivesql