Security Considerations in Blue-Green Deployments
tl,dr; Blue-Green deployments for critical uptime applications is a strong deployment strategy but if a deployment fixes critical security issues be sure that the definition of “deployment complete” is decommissioning of the “blue” environment and not just deployment of “green” successfully.
Organizations have gotten used to following Continuous Integration/Continuous Deployment (CI/CD) for software releases. The use of cloud solutions such as AWS Code* utilities, Azure DevOps or Google Cloud Source repositories enables enterprises to quickly and securely accomplish CI/CD for software release cycles. Software upgrades undergo the same CI/CD tooling and the dev teams need to choose how to do the upgrades.
There are a few different ways development teams upgrade their applications – full cut-over (same infrastructure, new codebase deployed directly and migrated in one go), rolling deployments (same infra or new infra gradually upgrading all instances), immutable (brand new infra and code for each deployment and migrated in one go), blue-green deployment (same infra, simultaneously deployed in prod, gradually phasing out old instances upon successful tests from a section of traffic). The Blue-green deployment strategy has, therefore, become quite popular for modern software deployment.
What is a Blue-Green deployment?
When you release an application via Blue-Green deployment strategy you gradually shift traffic as tests succeed and your observability (Cloudwatch alarms, etc.) does not indicate any problems. You can do that via Containers (ECS/EKS/AKS/GKE) or AWS Lambda/Azure Functions/Google Cloud Functions and traffic shifting can be done with the help of DNS solutions (Route53/Azure DNS/Google Cloud DNS), Load balancing solutions (AWS Elastic Load Balancer/Azure Load Balancer/Cloud Load Balancing). Simplistically, take your current (blue) deployment and create a full stack (green) and use either DNS or load balancers to slice out a traffic section and test the “green” stack. This is all happening in production, by the way. Once everything looks good, direct all the traffic to green and decommission the “blue”. This helps maintain operational resilience and, therefore, this is a popular deployment strategy. AWS has solid whitepaper I recommend to review to dive in from a solution architecture standpoint if you are interested.
Security Considerations
Some critical security issues (e.g., remote code execution via Log4j, remote code execution via struts, etc.) demand immediate fixes because of their severity. If your blue-green deployments are going to take days and your tests will run over a very long period (say days) then any security fixes you make also will get fixed after a successful “green” deployment and “blue decommission” only. If during that window or prior, an attacker managed to get a foothold into the impacted “blue” environment, then even decommissioning of the “blue” becomes critical to claim the issue is fully remediated. Typically, when incident responders and security operations professionals breathe a sigh of relief is when the fix is deployed. Typically, the software engineering teams consider fix as deployed is when the “green” is fully handling all traffic (and its unrelated to the decommissioning of the “blue” environment). In this case, the incident responders need to remember its not the deployment time when the risk is truly mitigated, its mitigated after completion of cut-over to green and decommissioning of the blue. There is a subtle, yet important, difference here – and it really comes down to the use of shared vocabulary. As long as security operations and software development teams both have this shared definition of what deployment means, there are no misunderstandings.