Resilience between the storage and the Event Grid-CodePudding

Here is my flow for my imports:

When a new file is detected on the blob storage and event is triggered into the Event Grid
The Event grid retry until he is able to call the Azure Function
The Azure Function inject the event into the Service Bus's Queue
A webapp will consume the Queue

So I guess that this process is very resilient because each message is stored or retried. The only step that could fail is the connection between the storage and the event grid. What if the connection between the Storage and the event grid is down when a file is created on the storage. How can I be sure that the event will still be triggered?

CodePudding user response：

... So I guess that this process is very resilient because each message is stored or retried. The only step that could fail is the connection between the storage and the event grid. What if the connection between the Storage and the event grid is down when a file is created on the storage. How can I be sure that the event will still be triggered?

You can't. Though it is an at least once delivery system so internally it has mechanisms in place. The Blob Storage is an Azure services that support system topics, it is outside your control. There could be an outage impacting the service though. There is no service level agreement for server-side disaster recovery (which is what we are dealing with in such scenario):

There is no service level agreement (SLA) for server-side disaster recovery. If the paired region has no extra capacity to take on the additional traffic, Event Grid cannot initiate failover. Service level objectives are best-effort only.

You can find the Recovery point objective online:

Recovery point objective (RPO)

Metadata RPO: zero minutes. Anytime a resource is created in Event Grid, it's instantly replicated across regions. When a failover occurs, no metadata is lost.
Data RPO: If your system is healthy and caught up on existing traffic at the time of regional failover, the RPO for events is about 5 minutes.

Recovery time objective (RTO)

Metadata RTO: Though generally it happens much more quickly, within 60 minutes, Event Grid will begin to accept create/update/delete calls for topics and subscriptions.
Data RTO: Like metadata, it generally happens much more quickly, however within 60 minutes, Event Grid will begin accepting new traffic after a regional failover.

So data loss is extremely unlikely to happen but there is no 100% guarantee. Should you now be worried? Probably not due to the very low chances.