Home > other >  The Spark Streaming
The Spark Streaming

Time:09-16

Spark Streaming is a pile of continuous RDD
Reading: 1. Data from a socket pull data regularly, interval for a period of time to generate a batch of RDD (convection process in fact is to deal with a batch of RDD), equivalent to use cups water, a cup of water to a batch data


Spark cluster real processing data is worker node
The direct mode
1. There will be a worker node executor process, then creates a receiver objects to pull data (such as from kafka)
2. The receiver can pull at the same time more than one topic of data, and the use of multiple threads at the same time to read
3. If the number is too much, inside the store, since the Checkpoint, set the HDFS address, storing data records are recorded in the log to the HDFS, avoid the loss of data
4. This should be offset by the zk
5. The efficiency of slow

Direct mode
1. There is no receiver
2. An executor is directly connected to a broker under the partition (a corresponds to a copy), by this time want to be in the corresponding relationship, a partition of kafka RDD a partition of
Part 3. Take some time off to kafka pull data, the offset of the need to manage, data processing, calculated using kafkacluster instance offsets, save to zk (you can also use checkpoint into HDFS, database or a file record), before the next to pull data to zk lookup offset
4. Kafka a topic partition number can be consistent and RDD partition number, one to one correspondence, improve efficiency
5. Compare the direct, can control the access flow, don't spill

If you need to calculate a certain period of time, you can use window function


The Spark to write data Kafka
1. The instantiation KafkaProducer
2. The synchronous, asynchronous write see Kafka related description

Kafka + Spark tuning
1. Set up reasonable batch time, avoid the data processing is not timely accumulation
2. Set the Kafka data pull rate, fast data accumulation, poor efficiency of slow
3. Set up reasonable partition number
4. Set the reasonable number of CPU resources
5. The use of high-performance operator



  • Related