Canaries are Great!
(cross-posted from the Cloud Foundry blog)
First a little background, and then a story. As Matt described here, Cloud Foundry BOSH has a great capability to perform rolling updates automatically to an entire set of servers in a cluster, and there is a defensive aspect to this feature called a “canary” that is at the center of this tale. When a whole lot of servers are going to be upgraded, BOSH will first try to upgrade a small number of them (usually 1), the “canary”, and only if that is successful will the remaining servers in the cluster be upgraded. If the canary upgrade succeeds, then BOSH will parallelize up to a “max in flight” number of remaining server upgrades until all are completed.
And now the story.
For the last few weeks I’ve been pairing on the Cloud Foundry development team here at Pivotal. I’ve had a chance to work on lots of cool things, I’ve seen the continuous integration (CI) pipeline in action, and today I got to be one half of the pair that did a deploy to production – that is, to run.pivotal.io. As a brief aside, the CI pipeline is way cool, with a number of different systems automatically running test suites with passing tests automatically promoting things. But when it comes to production deploys there is still a person that pushes the “go” button. Of course, just as any other thing we do here at Pivotal, that production process has been tooled so that, it really is usually a matter of pushing a metaphorical button. Today, however, we did do a bit of an upgrade to the tooling before we used it.
Yes, we had tests for the tooling code. Yes, the tests were passing. But when we ran things with the production manifests as input, an authorization token was wrong. BOSH did tell us of the change, but we got a bit overzealous and didn’t catch that change until after we said “yes” to the “are you sure you want to deploy this” question. Once we realized our problem, BOSH was already upgrading our marketplace services broker. Doh.
Could have been a very bad day. But, thanks to canaries, my heart didn’t even skip a beat.
Our production deployment runs multiple instances of the marketplace services broker. Enter the sacrificial canary. When BOSH started the upgrade it only took down one of the brokers, upgraded the bits and tried to restart it. In this case the air in the coal mine was toxic and our little bird did not survive :-(. As a result, we sent no additional souls in, leaving the other service brokers fully functional :-).
We fixed our problem, pushed the “go” button again and this time the canary came up singing, BOSH upgraded the remaining service brokers and our production deploy was complete.
It was a good day. A really, really good day.