Artikelserie: Kodgranskning del 3 av 3

Lead time from fix to production

Description

Imagine there is a market opportunity that requires ten lines of codes to change. How fast can that change make it into production?

Whenever we make a change to the code there is a process to put the changed functionality in production. The code needs to be compiled, build to a deployable artifact, tested, deployed etc. If this is a manual process, it might be onerous, error prone, and might require several attempts to get right. It’s not unusual that the process can take hours or days.  This process can be made even worse if there are toll-gates with bureaucratic sign-offs, especially if these sign-offs require a meeting being called.

Now imagine that the ten lines of code that needed to be changed is one line each in ten different systems. In these situations there’s also a need to do integration testing, and sometimes synchronised/coupled deploys. Lead times in these situations can vary from hour or days to months.

The security issue

If the reason for the code change was not a market opportunity but a detected vulnerability, there is a security problem. Long lead times make you slow and leaves security holes open for a longer period of time.

The security hole can be in code your organisation has written, and might have been detected through security testing. This is a reason for mending the code and to deploy a new version. While this is done, you are vulnerable.

The security hole can also be due to a weakness in a 3rd party library you use. In this case the vulnerability is probably known to the public, and if it is possible to detect you are using a specific version of a specific library, you are an open target. The time to upgrade the library and get the fix out to production constitute time you are vulnerable.

Time to patch your own code and time to upgrade libraries are direct effects on security. But long lead time also have dynamic effects. If lead times are long, then patching of operating system, web-servers, containers etc is a cumbersome process - which as a result tend to happen more seldom. This might lead to insecure components deployed in production for months.

Finally, if lead times are long due to manual maintenance of production servers, there is a risk that deploys and updates are done subtly different from time to time. Over time this leads to environments which no-one knows exactly how they are set up. If such an environment is infected it takes a lot of courage to tear it down and rebuild it from scratch.

Remedies

To get lead times down to minutes or seconds we need to rely heavily on an automated build pipeline and to define platform as code.

All the appropriate steps to get a new version to production needs to be fully automated and deployed by a build pipeline (powered by e g Jenkins, TeamCity or similar product). It is preferable if only the build pipeline has access to the production server as such, to discourage manual changes directly on the servers.

Virtual machines or container techniques such as Docker or BoxFuse can be used to create production (and test) environments in a repeatable and quick way. Environments are “baked from a recipe” which is kept updated. When a new environment is needed, for test or production, a new one is baked using the last update of the recipe. This makes it fast to create a new environment, as well as it removes the uncertainty of “what has been done on the server”.

To avoid coupled deploys, all components (applications, databases etc) should support a two generation architecture design. When a change is made, it should support both the new API/functionality as well as the old version for a while. This makes it possible to roll out even a large change system by system.

Läs de andra delarna i artikelserien