Canary Release Patterns for APIs, Applications, and Services with Ambassador

Daniel Bryant
Ambassador Labs
Published in
10 min readApr 4, 2019

--

With an increasing number of organizations adopting the Ambassador Edge Stack API Gateway as their main ingress solution on Kubernetes, we are now starting to see a series of patterns emerge about continuous delivery and the testing and releasing of new functionality. The canary releasing pattern is one of the most popular. So in this article, I will outline a series of related sub-patterns, implementation details, and things to watch out for with this approach.

Canary Releasing Prerequisites

The canary releasing pattern has been made popular by the cloud native vanguard, such as Netflix, Amazon, and Google. Obviously not every organization runs with this much experience or as many customers. From my experience, canary releasing is a genuinely useful pattern, but there are several prerequisites:

  • Basic automated (continuous) delivery pipelines: canary releasing relies on deploying and running multiple versions of a service, which is challenging to do (and not much fun) without some form of automation. Implementing a delivery pipeline that does basic continuous integration and quality assurance on a service will also reduce the number of issues you see within your canary experiments
  • Basic observability. To decide if a canary release is causing problems, you will need to implement basic monitoring (observability) that provides you with data in relation to the golden signals (latency, traffic, errors, and saturation). For example, you don’t want to continue migrating traffic to a canary service that is causing the CPU or I/O to spike above tolerable levels.
  • A codified definition of business success: Similar to basic observability, you will need to identify metrics that indicate success about a business hypothesis or goal. These are often referred to as key performance indicators (KPIs). For example, when rolling out new functionality, you will typically have goals in mind, such as increasing customer spending or lowering user complaints. Even if you aren’t experimenting with improving a specific business goal (for example, you are simply releasing code with architectural improvements), you will still want to monitor for negative impact on your metrics of success, and roll back the release if this occurs.
  • A sufficiently large amount of representative user requests. Although the likes of Netflix receive millions of user requests each day, many of us are not building systems for this scale. It is not easy to fully push enough traffic into a canary release to validate it. In addition to a sufficiently large traffic volume, you will also have to ensure that you test your canary with diverse and/or representative types of user requests to test all of the execution paths. For example, if you are canary testing a feature that each customer only uses at most once a year, you will need a very large number of users to generate a statistically significant success rate.

I’m sure there might be a bit of nuance or disagreement with these prerequisites, so please do reach out to me if you have any comments.

Canary Releasing Patterns

I’ve written about the fundamentals of canary releasing in several other blog posts, but I often get asked specific questions via the Ambassador Labs Slack community about how to implement canary releases with Ambassador. The general approach follows the path from manual canaries, to semi-automated, to the fully-automated canaries (often by way of shadow canarying).

So, assuming you have the prerequisites mentioned above, let’s examine the typical evolution of canary releasing with Ambassador in more detail.

Step 1: Manual Canary

This approach to canary releasing is typically the easiest approach and often the best place to begin your journey. When adopting canaries, the key mindset change is to separate the deployment and release of new functionality within a service. You may also want to separate your deployment (build or delivery time) and release (runtime) configuration files and scripts into separate version control repositories. This leads to a clearer separation of concerns and potentially a cleaner VCS history.

Assuming that you use Kubernetes as your deployment platform of choice, the implementation is as simple as deploying a new “canary” service alongside your existing service and initially routing no traffic to this. At this point in the canary process, you have not “released” the canary, and you will have only modified your deployment configuration.

It’s also worth mentioning that you don’t necessarily need to perform a rolling update of the service via Kubernetes Deployments at this stage, as this is a completely new service being deployed. You will, however, need to name and label this service as appropriate, for example, appending a version number or UUID to the service name to make this unique.

At this point in the process, you add a new Mapping for the canary service into your release configuration Kubernetes Ambassador YAML files and also add appropriate traffic shifting “weight” property that will route the related percentage of traffic away from the original service to the canary (note that the prefix target URL path is identical for the canary and existing service):

---
apiVersion: ambassador/v1
kind: Mapping
name: shopping-basket-canary
prefix: /basket/
service: shopping-basket-v0023
weight: 10
---
apiVersion: ambassador/v1
kind: Mapping
name: shopping-basket
prefix: /basket/
service: shopping-basket-v0010

You then apply the updated YAML to your cluster via kubectl or ideally trigger this via a delivery pipeline for your release configuration. The emerging best practice for this style of automating all aspects of configuring a cluster and using a version control repository as the single source of truth is being referred to as “GitOps”.

As the specified percentage of user traffic is routed through the canary, you then begin watching your service dashboards and observing top-line metrics (e.g., amount of requests being routed to the canary, request handling latency, 5XX errors, resource usage, etc.) and related business KPIs (e.g., the new shopping basket improving checkout conversion). If the metrics look favorable (including the amount and type of requests being sent to the canary), then you will increase the traffic shifting weight in the canary service Mapping, and iterate again on this process.

If the metrics show a negative impact, then you can change the traffic shifting weight to 0 and investigate the issue. You can then either begin the experiment again or potentially delete the canary service to deploy an improved version later.

It’s worth mentioning that you will need to account for the time lag between incrementing a shift in traffic and being able to observe the effects in the metrics — for example, if you only scrape metrics from your services every 30 seconds, then there could potentially be a 30 second window where your new release is causing an issue, and you don’t know about this unless you have more fine-grained alerts. You will also typically want to keep an eye on other (supposedly unrelated) alerts that you have configured, as it’s not unknown for a subtle change in a canary service to cause an unintended issue in an upstream or downstream service.

Once you are routing 100% of the traffic to the canary service and are happy with the result, you can delete the original service in Kubernetes. You can then change the Mapping name, removing the word “canary” and switching this to the appropriate “production” name. (The alternative is to use a “blue/green” style naming convention, which doesn’t require the Mapping name to be changed upon a successful release):

---
apiVersion: ambassador/v1
kind: Mapping
name: shopping-basket
prefix: /basket/
service: shopping-basket-v0023

You will also need to make sure that you synchronize all of these config update tidying actions with your VCS and the state of the cluster before you begin more experiments or releases.

The canary release cycle then begins again with the testing of the next version of the service.

Step 2: Semi-automated Canary

The next evolutionary step with canary releasing is to automate traffic shifting to a new canary candidate from 0 to 100 percent of traffic. This can be implemented at a crude level simply with bash scripts, or ideally triggered via updated config files which are then actions via a continuous delivery pipeline (for example, using Jenkin’s pipeline-as-code scripting) and can be as simple as incrementing the percentage of traffic shifted within the Ambassador Mapping at a specified time period, e.g., 60 seconds. You will still have to monitor associated top line and business metrics manually, and abort the release if you identify anything negative.

Some organizations also incorporate an automated simple canary release rollback into the script. You can implement something similar, providing that you can programmatically access and test the new functionality or associated metrics.

I’ve included a very pseudo-script example below, which provides a (very raw) template as to how this can be implemented

for (i = 0; i < 100; i=i+ 10) {    cat <<EOF | kubectl apply -f -
...
---
apiVersion: ambassador/v1
kind: Mapping
name: shopping-basket-canary
prefix: /basket/
service: shopping-basket-v0023
weight: {i}
...
EOF
//trigger deployment of new config
git commit -m “Increment canary of service X to {i}”
git push
sleep 60
// potential curl check to canary endpoint
}

When automating the rollout of a canary, you will need to ensure that enough user requests have been shifted to the canary release. This can be a particular issue for time-based traffic shifting increments, where an unexpected reduction in traffic can result in a false-negative result. For example, if your user traffic fluctuates greatly at off-peak hours, then releasing a canary here could mean that it does not see a representative load even with a high percentage of traffic being shifted.

Step 3: The Fully Automated “Robo” Canary

The ultimate goal that many organizations are aiming for is the fully-automated canary release, where developers simply flag an updated service as ready for deployment, define associated operational and KPI metric thresholds, and then rely on the deployment pipeline to autonomously manage the entire canary release/rollback lifecycle. This is what the Netflix team have talked about with Spinnaker and Kayenta, and also what the Weaveworks team are working on with Flagger.

While the fundamentals of this approach are not much different than the semi-automated canary mentioned above, there is a significant amount of extra sophistication required. This is typically focused on automated metrics collection and analysis, and “outlier detection”. Humans are typically very good at eyeballing a series of dashboards and deciding whether a new canary is performing positively or negatively. However, outside of the obvious cases, such as a service crashing or a massive increase in latency. It can be challenging for a computer to determine whether a canary rollout should be continued or halted.

Another often underestimated problem is identifying the traffic and data thresholds that are sufficient to make a decision. During a canary release, every tick of the clock provides incrementally more information about the canary. At some point, though, enough data has been gathered and a decision should be made. How exactly do you identify that point?

We’ve encountered very few organizations that have successfully implemented fully automated canary testing in the real world, and I see several barriers to widespread adoption. The first is the lack of general adoption of sophisticated traffic shaping systems such as Ambassador, although the prevalence of this issue is rapidly decreasing as organizations adopt more cloud-native infrastructure.

The second is the actual metrics analysis technology, which tends to be domain-specific, and is typically a challenging engineering problem to implement successfully. The final barrier we see is cultural — getting organizations to accept the notion of testing in production and trusting automated systems to get it right.

Alternative Patterns

Not every organization follows the evolution described above. For example, some opt to begin their journey with the semi-automated canary. Others encounter different challenges and may instead adopt some of these alternative patterns.

Not Comfortable Testing In Production? Use a Shadow Canary

Several of the organizations we have been working with are not yet fully comfortable with testing with production traffic, or can’t do this for regulatory or compliance reasons. An alternative pattern that still provides some of the benefits is the “shadow canary”. With this approach you can either deploy your canary service into production or within a staging environment, but you shadow or “mirror” a percentage of production traffic to the canary. Although you observe the canary, you do not return any of the results to the end user, and all of the related side effects are enacted on non-production systems or are mocked or implemented as no-ops.

With this pattern, 100% of the production user requests are served by the original service running in production at all times. This often satisfies concerns around testing in production, but in some cases, you may also have to anonymize data within user requests before it can be sent to a non-production environment. Ambassador supports the addition of custom “filters” to the request handling process, which can be used to implement data cleansing and transformation.

Upon a successful shadow canary release test, you can then deploy and release the newly canaried service using your existing processes.

Advanced Shadowing: Tap Compare

You can implement a shadow canary in a manual, semi-automated, or fully-automated manner. The fully-automated approach enables the use of additional techniques such as “tap compare” testing, where you can verify that the response of a new canary candidate does not deviate inappropriately from the existing version (often with routing to an additional secondary instance of the existing service in order to identify the “non-deterministic noise” in the response, such as UUIDs and dates).

Diagram showing how Diffy operates (taken from Twitter Engineering Blog)

This is an approach that was popularised via Twitter and their Diffy tool, and there are also additional implementations such as Diferencia.

Limited by Traffic Volume? Use Synthetic Canarying

It is not uncommon for organizations to have important services that either receive a small number of user requests or a small amount of a specific request. Therefore, canary releasing isn’t an effective testing strategy. An alternative pattern that helps in this situation is the “synthetic canary”, which relies on automatically generating synthetic user requests, or “synthetic transactions”, that are sent to the canary.

This pattern is often combined with the “shadow canary” pattern discussed above. You can deploy a canary release into production or a staging environment and only allow access to this from requests with a specific header set or with a specific authentication/authorization — i.e. no production traffic will be sent to the canary. You can then run a series of tests, synthetic transactions, or a load test that targets the canary release and observe the results. This can be a manual process, or you can semi- or fully-automate this.

Conclusion

This article has presented a whistle-stop tour of how I see organizations implementing the canary releasing of functionality with the Ambassador API gateway. There is no single “right way” to begin the journey to testing functionality with real user traffic, but there are definitely several gotchas to watch out for, such as the need for the prerequisites mentioned. The Ambassador Labs team is working on related solutions with several Ambassador users. Get in touch if you want to share any experience of comments with implementing automated canary releasing.

As usual, you can also ask any questions you may have via Twitter @ambassadorlabs, Slack or via GitHub.

--

--

DevRel and Technical GTM Leader | News/Podcasts @InfoQ | Web 1.0/2.0 coder, platform engineer, Java Champion, CS PhD | cloud, K8s, APIs, IPAs | learner/teacher