Home > Net >  Configuring connectors for multiple topics on Kafka Connect Distributed Mode
Configuring connectors for multiple topics on Kafka Connect Distributed Mode

Time:11-11

We have producers that are sending the following to Kafka:

  • topic=syslog, ~25,000 events per day
  • topic=nginx, ~5,000 events per day
  • topic=zeek.xxx.log, ~100,000 events per day (total). In this last case there are 20 distinct zeek topics, such as zeek.conn.log and zeek.http.log

kafka-connect-elasticsearch instances function as consumers to ship data from Kafka to Elasticsearch. The hello-world Sink configuration for kafka-connect-elasticsearch might look like this:

# elasticsearch.properties
name=elasticsearch-sink
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=24
topics=syslog,nginx,zeek.broker.log,zeek.capture_loss.log,zeek.conn.log,zeek.dhcp.log,zeek.dns.log,zeek.files.log,zeek.http.log,zeek.known_services.log,zeek.loaded_scripts.log,zeek.notice.log,zeek.ntp.log,zeek.packet_filtering.log,zeek.software.log,zeek.ssh.log,zeek.ssl.log,zeek.status.log,zeek.stderr.log,zeek.stdout.log,zeek.weird.log,zeek.x509.log
topic.creation.enable=true
key.ignore=true
schema.ignore=true
...

And can be invoked with bin/connect-standalone.sh. I realized that running or attempting to run tasks.max=24 when work is performed in a single process is not ideal. I know that using distributed mode would be a better alternative, but am unclear on the performance-optimal way to submit connectors to distributed mode. Namely,

  • In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call? Or would it be best to break up multiple .properties configs connectors (e.g. one for syslog, one for nginx, one for zeek.**) and submit them separately?
  • I understand that tasks be equal to the number of topics x number of partitions, but what dictates the number of workers?
  • Is there anywhere in the documentation that walks through best practices for a situation such as this where there is a noticeable imbalance of throughput for different topics?

CodePudding user response:

In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call?

It'd be a JSON file, but yes.

what dictates the number of workers?

Up to you. JVM usage is one factor that you can monitor and scale on

Not really any documentation that I am aware of

  • Related