AWS Multi region high availability architecture for serverless stack-CodePudding

I am in process of coming up with a multi-region high availability (active-active) architecture for my product. A simplified version of our stack is that we use Lambda to implement our micro services, which are exposed as APIs using API Gateway. These micro services integrate with downstream services or databases like DynamoDB, Aurora RDS. So, '

Route 53 >> Api gateway >> Lambda >> Downstream service/Database

I am trying to figure out what is the best mechanism to configure Route 53 such that it understands any of the services in the stack fails so that it routes the incoming requests to another region. Eg if Lambda service in region-1 fails, then it is easy because I would create Health Check records pointing to these Lambdas and once they are not reachable Route 53 will itself route to next requests to region-2. However, if the downstream resource eg RDS that Lambda is dependent on fails, how will Route 53 know this so as to route the next requests to region-2?

Appreciate any pointers on this.

CodePudding user response：

It depends a bit on your envision failover setup. Let us assume you have two regions: region1,region2

Now you could have two failure scenarios:

Lambda fails in region1 => you failover to Lambda in region2
RDS fails in region1 => you failover to RDS in region2

In both cases you need to ask yourself: What I want to do. If for example, in case 1 you connect from Lambda in region2 to RDS in region1 then high region transfer costs may occur, so you may want to trigger in any case a fail-over of RDS to region2.

Note: Generally it is very advisable to not connect directly with Lambda to RDS, but use instead RDS proxy (to avoid hammering the database with requests, slowing it down etc.): https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-proxy.html

Generally, with RDS these region failovers are much more complicated (can answer on that bit if needed). It is also not simply changing IP to another region, because usually you need to promote in the other region the database (cluster) as a designated node to allow write operations.

For the databases (DynamoDB, Aurora) you mentioned there is though a solution: Use Global Tables.

A simpler solution could be - depending on your application - to use DynamoDB Global Tables (see https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GlobalTables.html). However, clearly DynamoDB is not a relational database so it may not fit all cases. Nevertheless, DynamoDB works generally very good with Lambda and is also easier for cross-region replication. Note: if you encrypt your data using AWS KMS CMK (recommended) then you need to have this key also available in all regions where you plan to use Global tables (see https://docs.aws.amazon.com/kms/latest/developerguide/multi-region-keys-overview.html).

Another solution could be AWS RDS Aurora Global Tables (https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database.html) - those are available in multiple region and failover is thus easier (cf. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database-disaster-recovery.html).

In the Aurora case you have to detect a region failure yourself (e.g. you could have a lambda in both regions that regularly tries to connect to the current active cluster for writing) and automatically promote the cluster in the new region as primary if it is not available in the original region.

Do not forget: You need to regularly test the failover otherwise it is almost ensured that it will not work when you need it.

Generally having databases cross-regions implies transfer costs and additional resource costs compared to a single region - not only during failover, but all the time data is written.

CodePudding user response：

With this configuration, I recommend failing over the entire stack (to another Region), rather than failing over individual tiers (components) of the architecture. (This is what you seem to be saying in your question, but just making sure we are on the same page).

Your question comes down to how to configure the health check, and specifically how to implement shallow versus deep (checking dependencies like RDS) health checks.

There is an AWS Well-Architected lab that covers these concepts Implementing Health Checks and Managing Dependencies to improve Reliability.