App Restarting in Cloud Foundry
A couple of weeks ago, right before going on what turned out to be a glorious vacation in the sun, I stood up a local Cloud Foundry on my laptop using the cf-vagrant-installer from Altoros. Turns out there was a bug in a couple of the configuration files (pull request has already been merged) which offered a beautiful learning opportunity for me and I want to share.
Here’s what kicked it all off. I went through the cf-vagrant-install and pushed an app. Sure enough, it all worked great. Then I shut down my vagrant machine, started it back up and expected my app would similarly restart, but it didn’t. Instead when I ran a cf apps it showed my app with 0% of them started.
cdavis@ubuntu:$ cf apps Getting applications in myspace... OK name status usage url hello 0% 1 x 256M hello.vcap.me
Hmm. Okay, so let me try something simpler – who knows what the cf-vagrant-installer startup scripts are doing, maybe the left hand can no longer see the right after a restart. So I cleaned everything up, pushed the app and it was running fine. I then went and killed the warden container that was running the app (a separate post on how I did that coming soon) – and again, the app didn’t restart. And it stayed that way. It never restarted. Yes, it’s supposed to. So I dug in and figured out how this is supposed to work:
There are four Cloud Foundry components involved in the process, the Cloud Controller, the DEA, the Health Manager and NATS. The Cloud Controller (CC) knows everything about the apps that are supposed to be running – it knows this because every app is pushed through the CC and it hangs on to that information. To put it simply, the CC knows the desired state of the apps running in Cloud Foundry. The DEA is, of course, running the application. The Health Manager (HM) does three things – it keeps an up to date picture of the apps that are actually running in Cloud Foundry, it compares that to the desired state (which it gets from CC via an HTTP request) and if there is a discrepancy, it asks the CC to fix things. And finally NATS facilitates all of the communication between these components. (BTW, Matthew Kocher posted a nice list of the responsibilities of the Cloud Foundry components on the vcap-dev mailing list).
Here is what happens.
The DEAs are constantly monitoring what is happening on them – they do this in a variety of ways, checking if process IDs still exist, pinging URLs, etc. If the DEA realizes that an app has become unavailable, it sends a message out onto NATs, on the droplet.exited channel with details. The HM subscribes to that channel and when it gets that message does the comparison to the desired state. Note that an app instance could have become unavailable because the CC asked for it to be shut down – in which case the desired state would match the actual state after the app instance became unavailable. Right? Assuming, however, the app crashed, there would be discrepancy and the HM would tell the CC that another instance of the app needed to be started. The CC would then decide which DEA the app should start on (that is (part of) its job) and lets that DEA know. The DEA starts the app and all is good.
That’s a bit confusing so here’s a picture that roughly shows this flow – you shouldn’t take this too literally, especially the sequencing, for example, the HM asking the CC for the desired state is something that happens asynchronously, not as a result of the DEA reporting a crashed app. This picture is just intended to clarify responsibilities of the components.
Another place to see this in action is by watching the NATS traffic for a given app. (I’m writing another post to talk about this and other tools I used in my investigations, but for now, just enjoy what you see.) What this shows are the heartbeat messages sent out by a dea showing the apps that are running. Then we get a droplet.exited message that starts the whole thing going. Eventually you see the heartbeat messages again showing the app as running.
06:03:53 PM router.register app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007 06:03:58 PM router.register app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007 06:03:58 PM dea.heartbeat dea: 0, crashed: 0, running: 1 06:04:03 PM router.register app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007 06:04:08 PM router.register app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007 06:04:08 PM dea.heartbeat dea: 0, crashed: 0, running: 1 06:04:13 PM router.register app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007 06:04:13 PM router.unregister app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007 06:04:13 PM droplet.exited app: hello, reason: CRASHED, index: 0, version: b48c1871 06:04:14 PM health.start app: hello, version: b48c1871, indices: 0, running: 0 x b48c1871 06:04:14 PM dea.0.start app: hello, dea: 0, index: 0, version: b48c1871, uris: hello.172.16.106.130.xip.io 06:04:16 PM dea.heartbeat dea: 0, running: 1 06:04:16 PM router.register app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61011 06:04:18 PM router.register app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61011 06:04:18 PM dea.heartbeat dea: 0, crashed: 1, running: 1 06:04:23 PM router.register app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61011 06:04:28 PM router.register app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61011 06:04:28 PM dea.heartbeat dea: 0, crashed: 1, running: 1
One last thing you might ask is this. What if somehow the message sent by the DEA that an app has crashed goes missing? We are NOT depending on durable subscriptions (which would be a grind on performance) so what is our mechanism for ensuring eventual consistency? Remember that I said that the HM does three things, including keeping track of the actual state of the system. It can do this because every 10 seconds each DEA sends a heartbeat message (as you can see above) out onto NATS reporting how the apps that are running on them are doing. If the HM doesn’t get the direct message that an app has crashed, from the heartbeat messages it will eventually see an actual state that doesn’t match the desired state. At that point it will contact the CC just the same as described above.
I’ve not yet grown tired of killing apps in every which way, destroying the warden container, going into the warden container and killing the app process, restarting the DEA and so on, and watching the state of the system eventually (within a few seconds) come back into equilibrium. Way cool. So very way cool!
Share Your Thoughts