Dirty data generated by the Spark Job fail, should be how to deal with (Failover Design)?-CodePudding

Scene: such as a Job to 100 groups of data processing, when dealing with the 60 by the network abnormal reasons led to a Job such as Aborte, namely partial data has been processing (P.S. need to handle the Job after the fail data)
Possible solutions:
1. All data will be processed according to the job rollback, then re-run the job
2. According to the failure of the job: to carry out the operation of the similar breakpoint continuingly,
3.,,,,,,,,
Question: what are the current industry in the feasible solution, and a few specific implementation plan, if the above 2 how can continuingly should how to obtain this checkpoint, etc, it is best to combine some simple explanation, I am a rookie, also is beginning to understand the Spark, this two days I hope god can give refers to the way,

CodePudding user response:

Blood and tears of the lesson:
1, the processing of the source table data, the results don't back to the source table, the source table must be read-only,
2, if you have any requirement on the data integrity, it can be: 1, write the result table before, first delete the duplicate data (for example, we calculate some data every day, first remove the results have some results on the day of the data in the table, and then write new) 2, the processing of time into a temporary table, after processing, such as from a temporary table write result table,
3, the use of objective exclusive fields such as time, since the ID to control the source data range,