Reliability v/s Resiliency Design Strategies For Microservices

5 min readJan 31, 2023

What is Reliability v/s Resiliency?

Reliability, Resiliency and Recoverability are words often used interchangeably however, when designing software systems it is important to understand how they are related as well as differ from each other.

Reliability means ability of a service to be healthy under normal conditions while Resilience means ability of a service to mitigate, survive, and recover quickly from high impact disruptions and remain functional from the customer perspective.

Reliability is the outcome and resilience is the way you achieve the outcome.
A service may be considered reliable simply because no component has ever failed, but it may not be considered resilient because the reliability-enhancing qualities may have never been put to the test.

The key takeaway here is that systems must be resilient and reliable at every stage of the software development lifecycle.

Strategies To Design Reliable Microservices

Reliability is “the probability of failure-free software operation for a specified period of time in a specified environment,” according to the IEEE Reliability Society.

Architect For Redundancy; Avoid Single Points Of Failure

The Single Responsibility Principle should be applied to the entire service, and logically unrelated system components should be isolated with well defined boundaries. Design with minimal incremental work to scale out with more computation and storage nodes. To find potential issues with component interactions, use the DIAL (Discovery, Incorrectness, Authorization/Authentication, Limits/Latency).

Limit Queue-based Solutions

Avoid creating queue-based solutions that have a risk to spiral out of control. When queued items are more expensive to process than the latency SLA, discard requests with the error. Beware of design patterns that, from a high level perspective, mimic a queue.

Automate Scaling Solutions & Deployment Risks

In comparison, Autoscaling allows to respond more quickly, while manual provisioning incurs operational overhead and is prone to operator error. Reduce blast radius from any infrastructure or configuration issues by deploying service to multiple-regions, multiple AZs. Automate rollbacks, validations, and deployments because every deployment has a risk of introducing new issues, uncovering hidden bugs, unforeseen hardware usage, etc.

Clear Known OE/Design Backlog

Most problems could have been prevented if known risks had been addressed promptly. Track down and fix OE/Design issues and root causes of health reboots. Every service should have have comprehensive monitors, alarms, and dashboards. Conduct annual operational readiness reviews.

Periodic Log Cleanup

Fix the underlying issue causing log exceptions and warnings to appear frequently. Review disk utilization trends and other logging patterns. Avoid excessive logging in failure cases.

Strategies To Design Resilient Microservices

It is impossible to prevent every failure. It is, however, possible to design a system that can mitigate and recover from any failure at all different stages of development.

Automate Recovery Strategy

Applications must be able to isolate system partition failures, identify unhealthy partitions, and transfer workloads from unhealthy to healthy partitions in a secure and reliable manner. Use services like S3, Lambda, Dynamo etc. that manage resilience on client’s behalf. For recovery, databases and logs must be backed up.

Degrade Gracefully

Instead of a total shutdown, plan for graceful degradation to address the problems and offer a subpar client experience or temporary noncompliance. For example: 1) Consider adding a cache layer to downstream calls so that cached objects can still be retrieved even if the downstream system is down. To keep the service running during a prolonged outage, it can be OK to use some old data with a warning to the consumer, 2) lengthen the cache TTL and stop calling failing dependencies.

Infrastructure Resilience

Ensure that servers, load balancers and databases are resilient to failures by using scaling configuration, backups, retention, limiting user rights, avoiding console operations for updates, and using code for infrastructure. Separation of read vs limited write vs admin permissions is necessary to avoid errors such as fat finger leading to outages.

Prevent Retry Storms

A limited, rate-based retry budget should be defined after aggregating retries at the process, host, or service level. Once the budget is exceeded, fail quickly and cheaply with an error, without retrying. Check this on how to avoid retry storms.

Set “Low” Timeouts On Dependency Calls

Set a service-wide SLA to define a response time budget that takes dependency timeouts into account. The deeper your service is in the call graph, the lower the timeout values should be. Limit retries and avoid high duration exponential backoff causing delays in recovery.

Have Levers To Limit Service Functionality

Establish tools to disable functionality and restrict load to dependencies. Use data from fewer dependencies and avoid calling degraded dependencies when these levers are engaged. Keep it simple, avoid bimodal (act one way when dependencies are healthy and another when they are not) behavior.

Routinely Test For/With Failure: Chaos/Stress/Soak Tests

Soak tests: long duration tests to find resource leaks. Stress tests: load tests with enough load to cause failure. Chaos tests: failure injection from dependencies. Identification of malfunctions and how system reacts to it, for example, load shedding or throttling once high call rate is identified.

What Defines Success?

A full measure of resilience requires a) static analysis of system which verifies architectural characteristics like redundancy, caching, timeouts, retries, and b) dynamic analysis of system which assesses the system’s capacity to maintain resilient performance using standard techniques such as stress and chaos testing.

Some measures of success can be:

Google’s Four Golden Signals — latency, traffic, errors, and saturation for measuring infrastructure reliability.
Recovery Time Objective (RTO) — the duration for a service to return to normal operating levels after a failure.
Recovery Point Objective (RPO) — the tolerance to data loss in terms of time duration.
Availability incidents/risks related to resiliency or scaling issues — This metric should decrease over time.
Resilience Score — Create an internal metric tracking a system’s adoption of resilience best practices.

Reliable microservices should be designed with evidence-based resilience and the quickest fully automated recovery!

Thank you for reading! If you found this helpful, here are some next steps you can take:

This blog is part of my System Design Series. Check out the other blogs of the series!
Send some claps my way! 👏
Follow me on Medium, Connect on LinkedIn & Subscribe below to get a notification whenever I publish! 📨