Home > Blockchain >  EventHub data bursty with long pauses
EventHub data bursty with long pauses

Time:12-05

I'm seeing multi-second pauses in the event stream, even reading from the retention pool.

Here's the main nugget of EH setup:

BlobContainerClient storageClient = new BlobContainerClient(blobcon, BLOB_NAME);
RTMTest.eventProcessor = new EventProcessorClient(storageClient, consumerGroup, ehubcon, EVENTHUB_NAME);

And then the do nothing processor:

    static async Task processEventHandler(ProcessEventArgs eventArgs)
    {
        RTMTest.eventsPerSecond  ;
        RTMTest.eventCount  ;
        if ((RTMTest.eventCount % 16) == 0)
        {
            await eventArgs.UpdateCheckpointAsync(eventArgs.CancellationToken);
        }
    }

And then a typical execution:

15:02:23: no events
15:02:24: no events
15:02:25: reqs=643 
15:02:26: reqs=656 
15:02:27: reqs=1280 
15:02:28: reqs=2221 
15:02:29: no events
15:02:30: no events
15:02:31: no events
15:02:32: no events
15:02:33: no events
15:02:34: no events
15:02:35: no events
15:02:36: no events
15:02:37: no events
15:02:38: no events
15:02:39: no events
15:02:40: no events
15:02:41: no events
15:02:42: no events
15:02:43: no events
15:02:44: reqs=3027 
15:02:45: reqs=3440 
15:02:47: reqs=4320 
15:02:48: reqs=9232 
15:02:49: reqs=4064 
15:02:50: reqs=395 
15:02:51: no events
15:02:52: no events
15:02:53: no events

The event hub, blob storage and RTMTest webjob are all in US West 2. The event hub as 16 partitions. It's correctly calling my handler as evidenced by the bursts of data. The error handler is not called.

Here are two applications side by side, left using Redis, right using Event Hub. The events turn into the animations so you can visually watch the long stalls. Note: these are vaccines being reported around the US, either live or via batch reconciliations from the pharmacies.

vaccine reporting animations

Any idea why I see the multi-second stalls?

Thanks.

CodePudding user response:

Event Hubs consumers make use of a prefetch queue when reading. This is essentially a local cache of events that the consumer tries to keep full by streaming in continually from the service. To prioritize throughput and avoid waiting on the network, consumers read exclusively from prefetch.

The pattern that you're describing falls into the "many smaller events" category, which will often drain the prefetch quickly if event processing is also quick. If your application is reading more quickly than the prefetch can refill, reads will start to take longer and return fewer events, as it waits on network operations.

One thing that may help is to test using higher values for PrefetchCount and CacheEventCount in the options when creating your processor. These default to a prefetch of 300 and cache event count of 100. You may want try testing with something like 750/250 and see what happens. We recommend keeping at least a 3:1 ratio.

It is also possible that your processor is being asked to do more work than is recommended for consistent performance across all partitions it owns. There's good discussion of different behaviors in the Troubleshooting Guide, and ultimately, capturing a /- 5-minute slice of the SDK logs described here would give us the best view of what's going on. That's more detail and requires more back-and-forth discussion than works well on StackOverflow; I'd invite you to open an issue in the Azure SDK repository if you go down that path.

Something to keep in mind is that Event Hubs is optimized to maximize overall throughput and not for minimizing latency for individual events. The service offers no SLA for the time between when an event is received by the service and when it becomes available to be read from a partition.

When the service receives an event, it acknowledges receipt to the publisher and the send call completes. At this point, the event still needs to be committed to a partition. Until that process is complete, it isn't available to be read. Normally, this takes milliseconds but may occasionally take longer for the Standard tier because it is a shared instance. Transient failures, such as a partition node being rebooted/migrated, can also impact this.

With you near real-time reading, you may be processing quickly enough that there's nothing client-side that will help. In this case, you'd need to consider adding more TUs, moving to a Premium/Dedicated tier, or using more partitions to increase concurrency.

Update:

For those interested without access to the chat, log analysis shows a pattern of errors that indicates that either the host owns too many partitions and load balancing is unhealthy or there is a rogue processor running in the same consumer group but not using the same storage container.

In either case, partition ownership is bouncing frequently causing them to stop, move to a new host, reinitialize, and restart - only to stop and have to move again.

I've suggested reading through the Troubleshooting Guide, as this scenario and some of the other symptoms tare discussed in detail.

I've also suggested reading through the samples for the processor - particularly Event Processor Configuration and Event Processor Handlers. Each has guidance around processor use and configuration that should be followed to maximize throughput.

CodePudding user response:

@jesse very patiently examined my logs and led me to the "duh" moment of realizing I just needed a separate consumer group for this 2nd application of the EventHub data. Now things are rock solid. Thanks Jesse!

  • Related