Compressing Avro data in Confluent S3 Kafka Connector-CodePudding

I have a Confluent sink connector which is taking data from a Kafka topic. It is then ingesting into an S3 bucket.

The ingest works fine and all was well, however now I am required to compress the Avro data before landing it into the bucket.

I have tried the following config

   {
  "name":"--private-v1-s3-sink",
  "connector.class":"io.confluent.connect.s3.S3SinkConnector",
  "tasks.max": "1",
  "s3.region":"eu-west-1",
  "partition.duration.ms":"3600000",
  "rotate.schedule.interval.ms": "3600000",
  "topics.dir":"svs",
  "flush.size":"2500",
  "schema.compatibility":"FULL",
  "file.delim":"_",
  "topics":"--connect.s3.format.avro.AvroFormat",
  "key.converter":"org.apache.kafka.connect.storage.StringConverter",
  "value.converter":"io.confluent.connect.avro.AvroConverter",
  "value.converter.schema.registry.url":"--systems",
  "schema.generator.class":"io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
  "partitioner.class":"io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
  "storage.class":"io.confluent.connect.s3.storage.S3Storage",
  "s3.bucket.name": "${S3_BUCKET}",
  "s3.acl.canned":"bucket-owner-full-control",
  "avro.codec": "snappy",
  "locale":"en-GB",
  "timezone": "GMT",
  "errors.tolerance": "all",
  "path.format":"'ingest_date'=yyyy-MM-dd",
  "timestamp.extractor":"Record"

The 'avro.code', I assumed would compress the data, however it does not. In its place I also tried ' "s3.compression.type": "snappy" ', still no luck! however this does work with JSON and GZIP.

Not quite sure what is going wrong?

CodePudding user response：

Those settings are only applicable to the S3 Avro writer, not the in-flight data from the producer, which would have to be "compressed" at the producer or broker/topic-level rather than a Connect setting.

Refer compression.type topic config

CodePudding user response：

For those who may come across this in the future.

I ran a test between this setting, using BZIP2 instead of snappy and no compression enabled.

This was the result:

No compression 58.2MB / 406 total objects
BZIP    19.9MB / 406 total objects
Snappy 31.1MB / 406 total objects

Ran over a 24 hour period all pulling from the same topic and placing into their own bucket.

As you can see the above config using snappy was actually working.

BZIP offered higher compression rates and seemed quicker.

Ultimately we had to use snappy though as Redshift ingests only allow Avro compressed with snappy, at this time anyway.