SparkStreaming Join problem-CodePudding

SparkStreaming program has two data sources were from two different topic, two by id to join in the topic, but the phenomenon of data may be not in the same batch, the second topic than the first number to early or late or not at all a piece, how to solve,

CodePudding user response:

Use redis cache, will not match the data in the first topic to redis, then each batch to read redis, if match to will delete this data,

CodePudding user response:

This scene made offline processing is more appropriate, the data to be born first

CodePudding user response:

refer to the second floor woloqun response:

made offline processing of this scenario is more appropriate, the data first landing

Was offline processing, now think of to make real-time

CodePudding user response:

reference 1st floor u012540384 response:

use redis cache, will not match the data in the first topic to redis, then each batch to read redis, if match to will delete this data,

Data volume is too big, a batch of millions

CodePudding user response:

Use hbase for storage, but will affect the efficiency of TieZhu finally how to operate?