Ability to
about kafka
Kafka is a distributed, partitioning and replication commit log service, it provides messaging capabilities and unique design, we can use this feature in the log polymerization process,
Kafka use basic messaging term is:
Topic: these are the news category,
Producers: this is the message is posted to the process of Kafka theme,
Users: subscribe to the topic and the process of information, the user is part of the user group, the user group is made up of many consumer instance, in order to realize the scalability and fault tolerance,
Agent: Kafka each server in the cluster are called agent,
From different sources to obtain log can process input by a few producers to various kinds of Kafka theme, and then used by consumers,
Kafka provides a variety of ways to push data to the theme:
From the command line client: Kafka has a command line client, used to obtain input from a particular file or standard input, and push them as message to Kafka cluster,
Use of Kafka Connect: Kafka provides a tool, the tool use connector to implement custom logic, with the import/export of data to a cluster,
By writing a custom integration code: finally a kind of method is used in the data source and Kafka with Java producers API integration code,
In Kafka, log data are managed by the server for each partition:
Kafka in a distributed system of multiple server log allocation between partitions, each partition across multiple server replication, in order to realize the fault tolerance, the adoption of this kind of partition system, Kafka provides parallelism in the process, a consumer group of more than one consumer can according to the same order of stored information to retrieve data at the same time,
In addition, Kafka allows the use of the required number of servers, it use disks for storage, thus load may be slow, but, because of disk storage capacity, it can store large amounts of data (i.e., to TB) for the unit, and longer retention time,
Redis in storage and various functions and Kafka is a little different, the core of the Redis is an in-memory data store, can be used as a high-performance database, cache and message broker, very suitable for real-time data processing,
Redis supports a variety of data structure is a string, hash, lists, collection and sorting collection, Redis also has all kinds of clients in a variety of languages, can be used to write custom procedures for insert and retrieve data, the main similarity between the two in both messaging services, but for the purpose of the log polymerization, we can use the data structure of the Redis to operate more efficiently,
In my test Redis and Kafka's performance, the results are very interesting,
Kafka
Kafka popular message queue system through Linkedin, major companies such as a large number of tests, in fact, its engineers are actually writing the first edition of Kafka, in their tests, Linkedin will Kafka in cluster mode with six computer use, each computer has Intel Xeon 2.5 GHz processor, six kernel, 32 GB of RAM and 6 7200 RPM SATA drives,
Producers
For the first test, have created a six partition and no copy of theme, using a single producer is generated in a single thread 50 million records, each message size of 100 bytes, use this setting to produce peak throughput of more than 800 k records per second, or 78 MB/SEC, in different tests, they use the same basic setup, three producers in the three different running on your computer, in this case, we see the higher peak, about 2000 records/SEC or 193.0 MB/SEC,
Asynchronous replication and synchronous replication
The second batch of test involves copying method, using the same number of records and message size, and use similar to the previous tests of a single generator, has three copies, and replication in an asynchronous manner, its throughput peak at about 766 k records per second, or 75 MB/SEC,
However, when the copy is synchronous - this means that the primary server is waiting for confirmation from the copy - peak throughput is low, about 420 k records/SEC or 40 MB/SEC, even though it is a reliable setup, because it can ensure that each message to arrive, but due to the primary server confirmation message receiving the time spent, thus considerably reduced throughput,
Consumers
In this case, they are using the same number of messages and size and 6 partition and 3 copies, they by increasing the number of consumers to apply the same method, in the first to use a single user tests, the maximum throughput of 940 k records per second, or 89 MB/SEC, but, not surprisingly, when using three users, throughput per second processing 2615 k records or 249.5 MB/SEC,
Kafka's throughput performance based on the number of producers, the combination of the user number and copy method to do this, one of the test is a single producer, three copies of a single user and asynchronous mode, the peak of the test is processing the record 795 k/SEC or 75.8 MB/SEC,
Message processing
As shown below, with the increase of record size, we can expect a second record of reducing:
Message size and throughput (record/SEC) (source)
But, as we can see in the picture below, along with the record size (in bytes) of growth, the throughput will also rise, the smaller the message will lead to low throughput, this is due to line up the message overhead will affect performance:
Message size and throughput (MB/s) (source)
In addition, as shown in the figure below, have been used the amount of data that will not affect the performance of Kafka,
Throughput and size (source)
Kafka is heavily dependent on the machine memory (RAM), as shown in the above, the use of memory and storage is the best way to maintain stable throughput, its performance depends on the data rate, using the data if the user is not fast enough, Kafka will have to read data from the disk instead of memory, it would reduce its performance,
Redis throughput
Let's check the Redis in terms of message processing rates of performance, we use the very basic Redis command to help us to evaluate its performance: SET, GET, LPUSH and LPOP, these are the commonly used Redis command, used to store and retrieve Redis values and list,
In this test, we created 2 m at the request of the key length is set to 0 to 999999, a single value size of 100 bytes, tested Redis use Redis benchmark command,
Redis pipeline
As shown below, in our first test, we found that when using Redis lines exist significant differences in terms of performance improvement, the reason is that through the pipeline, we can send multiple requests to the server, without having to wait for the reply, the last step to check reply,
With or without Redis pipeline (2.6 GHz Intel Core i5, 8 gb RAM) throughput and command
The data on the Redis size may be different, as you can see, the diagram below shows the throughput, with different values (message) in the chart, it's easy to see, when the message size increases, the throughput (the number of requests per second) will be lower, as shown below, this behavior consistent with all four commands,
Throughput with different message size (bytes)
In addition, as shown below, we measure the write data in bytes, we see that the increase in the number of records, write the number of bytes in the Redis also increased, in part, this is intuitive, and we noticed the same in Kafka,
The GET command of throughput and value size
Redis snapshot support Redis persistence model, it will be the be fond of according to the user generates snapshot in time, for example, including the time from the last snapshot or write the number, but, for example, if a Redis instance to reboot or collapse, the continuous snapshot between all data will be lost, in this case, the Redis persistence does not support persistence, and is limited to those not important application of recent data,
Kafka vs. Redis: abstract
As mentioned above, the Redis is a memory storage, which means that it USES its main memory for storage and processing, making it much faster than the Kafka based on disk, Redis memory stored in the only problem is we don't store large amounts of data for a long time,
Due to the main memory is less than the disk, so we should be clear data on a regular basis, method is automatically moving data from memory to disk and make room for new data, Redis persistence, it allows us to dump the data set, if necessary, to disk, Redis also follow the master-slave architecture, copying the primary server only enable persistence is useful,
In addition, the Redis does not like Kafka has the concept of parallelism, in parallelism, multiple processes can use the data at the same time,
Based on the function of these two tools, while the test for Kafka and Redis is not exactly the same, but we still can be concluded that, in with the minimum delay processing real-time message processing, you should first try to Redis, but, if the message is very big and should reuse data, should first consider the use of Kafka,