How to Safely Ship Changes to Production

We all know the drill, right? After we finish working on a code change, and [hopefully] testing it on our local machine, we then push the change to the next stage in the cycle. Local testing is very biased, and ideally, we would like to validate the change in a more stable environment (and also not follow only the point of view coming from the engineer who implemented the change).

One next step seems very natural here: Push the changes to a reliable staging environment and have partners (QAs, PMs, other engineers) help with the validation before moving the changes. This would be followed by bug fixing and re-validation until we believe it’s good enough to push to production. Great!

In most of the contexts, however, this simply doesn’t happen. It may be due to various reasons, but regardless of the reason, the consequence is that we often have to promote changes to production servers before they are tested or validated well enough.

The problem is… What if something breaks? Actually, how to detect issues earlier? Good news: It’s possible to adopt some tools and practices to make testing and validating in production not only a safe practice for you and your company, but maybe even a good idea.

The baseline: Metrics

Before we jump to testing in production, we have to talk about metrics: We need them to validate that the change we’re shipping produces the desired effect, doesn’t cause unwanted side effects, the product is still stable, etc. Without well-established metrics, we are basically blind when rolling out the changes. We’ll refer to metrics in many of the topics in the article, so let’s take a look at two different types of metrics that we should be mindful of.

Business Metrics

Business-related metrics like KPIs, goals, and user behaviour should be monitored after implementing changes to evaluate impact. Before any change, identify metrics expected to be affected. Equally important are guardrail metrics, indicators of what shouldn't change. Unpredicted shifts in these guardrails can signify issues with the new change, necessitating a review.

Technical Metrics

Once the business metrics are defined, it’s also important to understand the technical metrics. These are fundamental to make sure the systems stay healthy as changes are introduced over time. Here we’re talking about system stability, error rate, volume, machine capacity constraints etc.

Good technical metrics are also useful for explaining issues observed in business metrics, or quickly finding the root cause of regressions. For example, let’s say we observe users engaging much less with a particular feature after the last version rollout. An increase in request timeouts or error rates could quickly show which services/endpoints are causing the issue.

Monitoring

We have business and technical metrics well-defined, good! Now, we have to monitor them. There are many ways to do it, but a common first step is to build dashboards that track metrics over time, making unusual spikes easy to spot. Even better if the dashboard allows quick filtering of data based on specific segments that may be especially relevant for the business. Actively monitoring dashboards is a good way to quickly visualise the effects a new change has introduced in the system. Some companies consider active monitoring so important that they even have 24/7 monitoring shifts to detect and address issues as early as possible.

Another good way to monitor metrics is through automatic detection and alerts. For key metrics, alerts can provide real-time notification when something looks wrong. Let’s say we start to roll out a feature, and a few minutes after the process starts we receive an alert saying the error rate is increasing above a specific threshold. This early notification can prevent us from propagating the change further in production and save us from a lot of problems!

Lastly, it’s important to be mindful of how much information we need and under what circumstances. While dashboards are very useful in providing a visual glimpse of the product and system performance, adding 1,000 different charts is going to bring more confusion than clarity. Similarly, if we receive 1,000 alerts per day, it’s impossible to investigate and act on them, and they’ll end up being ignored.

Safer landing

Metrics defined, monitoring in place, great! Now let’s take a look at some tools and strategies to help us avoid problems, detect issues earlier and minimise impacts in production. Depending on how the production environment is set up, some of these will be harder to implement than others, and maybe won’t even make much sense when combined. However, each item here could help us to move closer to a safe and stable production environment.

Automated Tests

Automated tests, often sidelined when projects fall off track, can expedite development and make changes to production safer and quicker. The earlier issues are caught, the quicker they can be fixed, thus reducing the overall time spent in the process. The process of reverting changes, fixing and pushing them again is usually very stressful and can take away precious time.

Aiming for 100% test coverage with unit, integration, and end-to-end tests may be idealistic for most projects. Instead, prioritise tests based on effort versus benefit. Metrics can guide this: covering core business features is likely more crucial than lesser-impact niche features, right? Begin with core features, expanding as the system evolves.

The publish-to-production process should include running the test suite before deploying to production. Test failures should pause publishing, preventing production issues. It's preferable to delay a feature's release than to discover it's entirely malfunctioning the next day.

Dogfooding

Dogfooding is the process of releasing a feature for internal testing before it reaches final users. During dogfooding, the feature is made available in production, but only to internal users (employees, team members etc). This way, we can test and validate if the new feature is working as expected, using real production data, without impacting external users.

There are different strategies for dogfooding. For a simplified overview, we could group them into two bigger buckets:

Full artefact dogfooding: This is common, for example, on iOS/Android apps, where we have built-in tools to release a new app version to specific users, and then make this same version available for the general public in the stores.
Selective dogfooding: Sometimes, it’s not possible (or even desired) to dogfood the whole artefact, but we can still allow dogfooding based on specific user information. Let’s say, for example, we are able to identify employees by crossing some data. The application could then be configured to enable/disable a particular feature by making a check to this data and branching the user to the desired behaviour. The application then contains both features, but only some users would be affected by the new change. We’ll come back to some of these concepts in the next topics.

Canary Release

Canary release is a release process where instead of rolling out the changes in production to all servers at once, the change is made available to a small subset of them and monitored for some time. Only after certifying that the change is stable, it’s then pushed to the production environment.

Canary Release

This is one of the most powerful tools to test new features and risky changes, thus reducing the chances of breaking something in production. By testing the change on a group of users, we can stop/revert the rollout process if any issue is detected, avoiding the impact on most of the users.

Blue Green Deployment

Blue Green Deployment, a DevOps practice, aims to prevent downtimes by using two server clusters (Blue and Green) and switching production traffic between them. During feature rollout, changes are published to one set (Green) while keeping the other (Blue) unchanged. If problems arise, traffic can be swiftly reverted to the Blue servers, as they were kept running with the previous version.

#Blue Green Deployment

Blue Green Deployment is often contrasted with the Canary Release that we discussed earlier. We won’t dive into the details of this discussion, but it’s important to mention this to help us when deciding which tools are more suitable for our work.

Kill Switches and Feature Toggles

Kill switches are not originated in the context of software engineering, and the best way to understand their use is by looking back to the original intent and design. In machinery used in industries, kill switches are safety mechanisms that shut them off as quickly as possible through a very simple interaction (usually a simple button or on/off switch). They exist for emergency situations, to prevent one incident (machine malfunction, for example) to cause an even worse one (injuries or death).

In software engineering, kill switches serve a similar purpose: We accept losing (or killing) a particular feature in an attempt to keep the system up and running. The implementation is, on a high level, a condition check (see code snippet below), usually added in the entry point of a particular change or feature.

if (feature_is_enabled('feature_x')) {
  xNewBehaviour();
} else {
  xOldBehaviour();
}

Let’s say, for example, we’re shipping a migration to a new third-party API. Everything is alright in the tests, stable in the canary release and then the change is 100% rolled out to production. After some time, the new API starts to struggle with the volume and requests start to fail (remember the technical metrics?). Because we have a kill switch, API requests can be instantaneously reverted to the old API, and we don’t need to revert to a previous version, or quickly ship a hotfix.

Technically speaking, kill switches are actually a particular use case of feature toggles (aka feature flags). As we’re on the topic, it’s worth mentioning another great benefit of feature toggles: Enabling trunk-based development. Thanks to feature toggles, new code can be safely pushed to production, even if it’s incomplete or not yet tested.

Keeping old behaviour accessible

The code exemplified above probably left some of us wondering if that’s actually a good pattern, with both old and new behaviours living in the application at the same time. I agree that this is likely not the end state we want for our codebase, otherwise, every single piece of code would end up surrounded by if/else clauses, making the code unreadable in no time.

However, we shouldn’t always rush to delete the old behaviour. Yes, it’s very tempting to clean up the code as soon as it stops being used and avoid technical debts. But it’s also fine to leave it there for some time under a feature toggle. Sometimes, it may take a while until the new feature is stabilised, and having a backup option is a safe mechanism in case we need to revert to it, even if only for a short time.

The life cycle of each release is different, and it’s a good practice to keep track of when it’s a good time to get rid of old code. Keeping the code clean and reducing the maintenance overhead will avoid the opposite situation where, although we have the feature disabled in the code, it’s probably broken given how long it’s been since it was disabled.

Shadow testing

One of my favourite techniques to implement safer changes is known as shadow testing or shadow mode. It consists of executing both old and new behaviours to compare the results but disabling some of the new behaviour side effects as applicable. Let’s take a look at this simple example:

int sum(int a, int b) {
  int currentResult = currentMathLib.sum(a, b);
  int newResult = newMathLib.sum(a, b);
  logDivergences(a, b, currentResult, newResult);
  return currentResult;
}

void logSumDivergences(int a, int b, int currentResult, int newResult) {
  if (currentResult != newResult) {
   logger.warn(     
    'Divergence detected when executing {0} + {1}: {2} != {3}',
       a, b, currentResult, newResult);
  }
}

Although both sum operations are executed, the new one is only used to compare and log divergences. This technique is particularly useful to monitor complex system changes and we expect some parity between the old and new behaviours. Another great use case is when we have to make changes in products we’re not very familiar with, or when we don’t know well what edge cases may be affected by the intended change.

In more complex scenarios, we may need to disable some side effects before enabling shadow testing. For example, let’s say we’re implementing a new backend API to sign up users and save to the DB, returning the user ID. We could even have a shadow DB in place, to execute the full process, but it’s definitely not a good idea to send the “Registration successful” email twice, one for each backend API. Also in the same example, we would need a deeper comparison logic, as simply comparing the returned user IDs wouldn’t be very useful.

Lastly, it’s important to understand what needs to be monitored and tested, and what criteria will be applied if parity is not achieved. In some critical scenarios, we will have to iterate on the shadow testing until the results are exactly the same. In others, it may be okay to have some % of divergence when the new implementation offers additional benefits that outweigh the loss.

Logs

Even with robust safeguards, systems can falter. When it happens, we need to be able to understand what’s going on, with the proper level of detail, otherwise, it may be extremely hard to land an efficient fix. Here’s where logs come to save the day.

While logging isn't a new concept and many easy-to-implement solutions exist, ensuring effective logs is challenging. Often, logs are unclear, overly complex, lacking, or flooded with irrelevant entries, making troubleshooting difficult. However, logs aren't solely for addressing issues. Proper logging aids in verifying the effectiveness of new features and changes. By sampling log entries, one can trace user journeys and confirm systems function as intended.

Final Thoughts

Shipping code to production is sometimes dangerous, but we have many strategies to make the process much safer. Even if we identify a problem, it’s also important to know what’s acceptable or not. Not all failures have to result in a rollback. What if we’re trying to fix a serious security flaw, or comply with a new regulation? Having clear criteria and understanding how critical the change is is very important to determine when to abort or proceed in case of issues. Going back to the beginning, the main metrics are there to help us in the decision process.

Safe landing, everyone!

Discussion (20)

Not yet any reply