Big Bang or Canaries — How to Choose the Right Change Deployment Strategy?
CrowdStrike — that’s a word already upsetting people in IT. The problem that arose after their failed update deployment crippled millions of Windows servers. A truly global problem.
But how could something like this happen? It was just a software update. First, in IT we don’t use the word “only”. Nothing is ever only something. Second, any software change can cause big problems. Even the smallest ones, so we need to be careful even with seemingly small changes brought about by updates. Third, for some reason, the company team ignored the basic principles for safe changes. And fourth, given the function that CrowdStrike’s software has — it is part of cybersecurity — it is necessary to choose an appropriate software deployment strategy for the changes.
Let’s take a closer look at the last two points. First, we’ll talk a little about deployment principles, and then we’ll introduce strategies that can significantly help with updates and upgrades. So, first, let’s talk about those principles.
Strategies Are Optional, Principles Are Not 🎯
That is, if we are talking about a responsible approach to changes. Of course, the company can decide to ditch everything and simply deploy a new version. The consequences are clear, however.
What should never be missing from the repertoire of programmers and corporate IT administrators is the testing process. Feedback on functionality and problems is critical for maintaining operation. And given that almost all companies are dependent on IT, it is not only about maintaining the operation of IT systems, but also the entire company.
It should also be almost forbidden to make changes on Fridays, even worse on Friday afternoons or evenings. If problems arise on the weekend, who will resolve them? Who will want to resolve them? This is a challenge especially for managers, who may not always be able to fully assess how dangerous any major interventions in operations are prior to the weekend.
And speaking of optional strategies, they are rather compulsorly optional. Each is not suitable for every application or type of change. It is, therefore, necessary to carefully evaluate the application’s features, its impact on the company’s operations, and the requirements it must meet.
What to Choose From? 🤔
We will now focus on 5 types of strategies. The first will be a one-time strategy. It will serve as a benchmark which we will compare the other 4 against.
🟦 One-time Deployment Strategy
As its nickname suggests — the big bang — it is a process in which a change is made completely for everyone at once.
Its definition also has advantages. It is fast, cheap, simple and does not require any extra measures. But all this is true only when there is no problem. If a problem occurs, it is quite problematic to return to the original stable version. It can also overload the infrastructure and support team. Everyone will want to resolve why something is not working. Everyone at once.
This strategy is very risky and, as can be seen from the situation that CrowdStrike faced, when a problem occurs, the costs are astronomical.
🟦 Segment Deployment Strategy
The company can also choose to divide its user base into several predefined segments. These can be divided according to the necessary criteria. Then, the individual segments will gradually switch to the new version.
This means that this approach is less risky. Not only is it easier to return to the stable version for the segment where the problem occurred. But it is also possible to make adjustments thanks to feedback to further reduce the risk. The impact on operations is thus minimal.
However, it is true that managing segments at different stages can be challenging for the project team. Moreover, individual segments may not accurately reflect the behavior of all users or the problems they will face. So it does not guarantee 100% protection, but this strategy provides one of the greatest protections for the company.
🟦 A-B Deployment Strategy
This procedure divides users into two equal groups. One stays with the original version A, the other gets the new version B. This allows you to compare which one better suits your company’s goals.
By running both versions simultaneously, the team can test metrics and collect data. This can then be used to make the most accurate decision. Given only half of the user base has the B version, it is still easier to go back to the original version A.
However, such process is very demanding to operate. Not only are there two versions running at the same time, but resources are also needed to analyze the metrics and collected data. The infrastructure is also more complicated, because it is necessary to direct members of each group to the correct version. So it takes longer, but the result is a decision based on real data.
You just need to be careful that group B is appropriately selected and represents the entire user base as best as possible.
🟦 Canary Deployment Strategy
This process involves testing the new version on a small group of “guinea pigs.” The users are randomly selected.
The small group is not selected randomly. This speeds up the feedback loop, so that problems can be solved quickly. It also minimizes risks, because it is easier to revert a small group to the original version than a large group or all users.
However, this still increases the cost of operation, because we still have two groups running side by side. At the same time, of course, diligence leads to an increase in the time for complete deployment.
This principle is an excellent way to test the readiness of a new version.
🟦 Phased Deployment Strategy
As the name suggests, this change divides users into individual phases in advance. Each phase includes a larger portion than the previous one. The deployment of changes is therefore gradual.
Small beginnings minimize the risk of problems, because changes are first tested on a small sample. This also reduces the feedback loop and makes it possible to solve potential problems and returns. In addition, this approach has the advantage of also testing scalability.
Of course, this makes deployment longer. It also requires more complicated management of the transition solution and communications. At the same time, it is necessary to plan well for infrastructure expansion.
Alternative History 🕰️
Now, looking at the Falcon situation, we can see where the mistakes were made. We might also be able to guess which strategy would have been a better choice for such a critical piece of software. But there’s no need to cry over spilled milk. It’s better to learn from this big fiasco.