There are two answers to your question as AWS ElastiCache can be used in different ways:
- Using just the master node
- Using the master and replicas
Explanation
AWS ElastiCache (non-clustered) comes with its very own failover mechanism that does not notify your application when a failover happens. It depends on your use whether this is good or bad:
Master-only use
If you want to rely on failover and you don't want to use your replicas for additional reads, then master-only use is the way to go. For master-only use, you point your client to the primary endpoint. If ElastiCache happens to failover, the client connection is reset. AWS updates behind the scenes the primary endpoint and once the client successfully reconnects, you're talking with the (new) master node again.
Why is it not possible to use replicas in this scenario?
The only topology source is the AWS ElastiCache node itself. lettuce does not connect to AWS's API (and this won't ever happen). Redis exposes connected replicas in the INFO REPLICATION
section but: The ElastiCache Redis node reports replica IP addresses that are not reachable hence it's not possible to connect to these nodes via topology discovery.
Using Master and Replicas
Although it's not possible to deduce the replica endpoints from an ElastiCache server, it's still possible to provide static endpoints. Lettuce connects to all nodes and determines on startup the node roles. This allows again routing according to the node role. If a failover happens (as in your case), Lettuce does not get notified about the failover and sticks to the initial topology.
Failover Notifications
Failover Notifications are the missing bit. While Redis Sentinel provides notifications that indicate a promotion/role change, there's no mechanism for 'just' Master/Replica. You could say: Ok, let's a disconnect as a signal to trigger a topology update. That might work in some cases, but in much more cases (network partition between the application and the Redis nodes, connection timeouts) it would trigger updates without the need. A regular topology upgrade is also just an attempt to discover changes.
The Third answer
I'm not happy with the AWS ElastiCache implementation. It works OK for Master-only use, but as soon as you want to use replicas, you're relying on a proprietary implementation of failover. Without AWS failover (i.e. in your own data center/Redis setup), you would be notified by some Ops people that Redis is down. They would either restart the Redis node or restart the application to restore operations. These signals are missing.
In the meantime, AWS provides Redis Cluster which might be the better HA/failover setup but Redis Cluster comes with severe limitations for applications. It could be possible also to poll on AWS' ElastiCache API to discover the topology from the API side of things and then kick off a topology update (reconnect).
Lettuce's Master/Replica API for static topology use is to provide at least a way to work with replicas. Everything else derives from this experience. Contributions in any form (experience, suggestions, documentation, code) are welcome.
Update: Aligned replica wording according to antirez/redis#5335