{"id":399,"date":"2013-07-18T17:19:30","date_gmt":"2013-07-18T17:19:30","guid":{"rendered":"http:\/\/corneliadavis.com\/blog\/?p=399"},"modified":"2013-07-18T17:19:30","modified_gmt":"2013-07-18T17:19:30","slug":"app-restarting-in-cloud-foundry","status":"publish","type":"post","link":"https:\/\/corneliadavis.com\/blog\/2013\/07\/18\/app-restarting-in-cloud-foundry\/","title":{"rendered":"App Restarting in Cloud Foundry"},"content":{"rendered":"<p>A couple of weeks ago, right before going on what turned out to be a glorious vacation in the sun, I stood up a local Cloud Foundry on my laptop using the <a href=\"https:\/\/github.com\/Altoros\/cf-vagrant-installer\">cf-vagrant-installer<\/a> from Altoros.\u00a0 Turns out there was a <a href=\"https:\/\/groups.google.com\/a\/cloudfoundry.org\/forum\/#!topic\/vcap-dev\/i7j8XAlJ2Eg\">bug<\/a> in a couple of the configuration files (<a href=\"https:\/\/github.com\/Altoros\/cf-vagrant-installer\/pull\/35\">pull request<\/a> has already been merged) which offered a beautiful learning opportunity for me and I want to share.<\/p>\n<p>Here\u2019s what kicked it all off.\u00a0 I went through the cf-vagrant-install and pushed an app.\u00a0 Sure enough, it all worked great.\u00a0 Then I shut down my vagrant machine, started it back up and expected my app would similarly restart, but it didn\u2019t.\u00a0 Instead when I ran a cf apps it showed my app with 0% of them started.<\/p>\n<pre>cdavis@ubuntu:$ cf apps\nGetting applications in myspace... OK\nname \u00a0 \u00a0status \u00a0 usage \u00a0 \u00a0 \u00a0url\nhello \u00a0 0% \u00a0 \u00a0 \u00a0 1 x 256M \u00a0 hello.vcap.me<\/pre>\n<p>Hmm.\u00a0 Okay, so let me try something simpler \u2013 who knows what the cf-vagrant-installer startup scripts are doing, maybe the left hand can no longer see the right after a restart.\u00a0 So I cleaned everything up, pushed the app and it was running fine.\u00a0 I then went and killed the warden container that was running the app (a separate post on how I did that coming soon) \u2013 and again, the app didn\u2019t restart. And it stayed that way. It never restarted. Yes, it\u2019s supposed to. So I dug in and figured out how this is supposed to work:<\/p>\n<p>There are four Cloud Foundry components involved in the process, the Cloud Controller, the DEA, the Health Manager and NATS.\u00a0 The Cloud Controller (CC) knows everything about the apps that are supposed to be running \u2013 it knows this because every app is pushed through the CC and it hangs on to that information. To put it simply, the CC knows the <em>desired state<\/em> of the apps running in Cloud Foundry. The DEA is, of course, running the application. The Health Manager (HM) does three things \u2013 it keeps an up to date picture of the apps that are actually running in Cloud Foundry, it compares that to the desired state (which it gets from CC via an HTTP request) and if there is a discrepancy, it asks the CC to fix things. And finally NATS facilitates all of the communication between these components. (BTW, Matthew Kocher posted a nice <a href=\"https:\/\/groups.google.com\/a\/cloudfoundry.org\/forum\/#!topic\/vcap-dev\/Z9OcnFWGqBA\">list of the responsibilities of the Cloud Foundry components<\/a> on the <a href=\"https:\/\/groups.google.com\/a\/cloudfoundry.org\/forum\/#!forum\/vcap-dev\">vcap-dev mailing list<\/a>).<\/p>\n<p>Here is what happens.<\/p>\n<p>The DEAs are constantly monitoring what is happening on them \u2013 they do this in a variety of ways, checking if process IDs still exist, pinging URLs, etc. If the DEA realizes that an app has become unavailable, it sends a message out onto NATs, on the droplet.exited channel with details.\u00a0 The HM subscribes to that channel and when it gets that message does the comparison to the desired state. Note that an app instance could have become unavailable because the CC asked for it to be shut down \u2013 in which case the <em>desired state<\/em> would match the <em>actual state<\/em> after the app instance became unavailable. Right? Assuming, however, the app crashed, there would be discrepancy and the HM would tell the CC that another instance of the app needed to be started.\u00a0 The CC would then decide which DEA the app should start on (that is (part of) its job) and lets that DEA know. The DEA starts the app and all is good.<\/p>\n<p>That\u2019s a bit confusing so here\u2019s a picture that roughly shows this flow \u2013 you shouldn\u2019t take this too literally, especially the sequencing, for example, the HM asking the CC for the desired state is something that happens asynchronously, not as a result of the DEA reporting a crashed app. This picture is just intended to clarify responsibilities of the components.<\/p>\n<p><a rel=\"attachment wp-att-400\" href=\"http:\/\/corneliadavis.com\/blog\/2013\/07\/18\/app-restarting-in-cloud-foundry\/apprestart\/\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-400 alignnone\" title=\"AppRestart\" src=\"https:\/\/i0.wp.com\/corneliadavis.com\/blog\/wp-content\/uploads\/2013\/07\/AppRestart.png?resize=289%2C336\" alt=\"\" width=\"289\" height=\"336\" \/><\/a><\/p>\n<p>Another place to see this in action is by watching the NATS traffic for a given app.\u00a0 (I\u2019m writing another post to talk about this and other tools I used in my investigations, but for now, just enjoy what you see.) What this shows are the heartbeat messages sent out by a dea showing the apps that are running. Then we get a droplet.exited message that starts the whole thing going. Eventually you see the heartbeat messages again showing the app as running.<\/p>\n<pre>06:03:53 PM router.register\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007\n06:03:58 PM router.register\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007\n06:03:58 PM dea.heartbeat\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0 dea: 0, crashed: 0, running: 1\n06:04:03 PM router.register\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007\n06:04:08 PM router.register\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007\n06:04:08 PM dea.heartbeat\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0 dea: 0, crashed: 0, running: 1\n06:04:13 PM router.register\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007\n06:04:13 PM router.unregister\u00a0\u00a0\u00a0 \u00a0 app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61007\n06:04:13 PM <span style=\"color: #ff0000;\">droplet.exited<\/span>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 app: hello, reason: CRASHED, index: 0, version: b48c1871\n06:04:14 PM <span style=\"color: #339966;\">health.start<\/span>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 app: hello, version: b48c1871, indices: 0, running: 0 x b48c1871\n06:04:14 PM dea.0.start\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 app: hello, dea: 0, index: 0, version: b48c1871, uris: hello.172.16.106.130.xip.io\n06:04:16 PM dea.heartbeat\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0 dea: 0, running: 1\n06:04:16 PM router.register\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61011\n06:04:18 PM router.register\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61011\n06:04:18 PM dea.heartbeat\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0 dea: 0, crashed: 1, running: 1\n06:04:23 PM router.register\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61011\n06:04:28 PM router.register\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 app: hello, dea: 0, uris: hello.172.16.106.130.xip.io, host: 172.16.106.130, port: 61011\n06:04:28 PM dea.heartbeat\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0 dea: 0, crashed: 1, running: 1<span style=\"font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; font-size: 13px; line-height: 19px;\">\u00a0<\/span><\/pre>\n<p>One last thing you might ask is this. What if somehow the message sent by the DEA that an app has crashed goes missing? We are NOT depending on durable subscriptions (which would be a grind on performance) so what is our mechanism for ensuring eventual consistency?\u00a0 Remember that I said that the HM does three things, including keeping track of the actual state of the system. It can do this because every 10 seconds each DEA sends a heartbeat message (as you can see above) out onto NATS reporting how the apps that are running on them are doing.\u00a0 If the HM doesn\u2019t get the direct message that an app has crashed, from the heartbeat messages it will eventually see an actual state that doesn\u2019t match the desired state. At that point it will contact the CC just the same as described above.<\/p>\n<p>I\u2019ve not yet grown tired of killing apps in every which way, destroying the warden container, going into the warden container and killing the app process, restarting the DEA and so on, and watching the state of the system eventually (within a few seconds) come back into equilibrium. Way cool. So very way cool!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A couple of weeks ago, right before going on what turned out to be a glorious vacation in the sun, I stood up a local Cloud Foundry on my laptop using the cf-vagrant-installer from Altoros.\u00a0 Turns out there was a bug in a couple of the configuration files (pull request has already been merged) which [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[5,21],"tags":[53],"class_list":["post-399","post","type-post","status-publish","format-standard","hentry","category-cloud","category-paas","tag-cloudfoundry"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/corneliadavis.com\/blog\/wp-json\/wp\/v2\/posts\/399","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/corneliadavis.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/corneliadavis.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/corneliadavis.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/corneliadavis.com\/blog\/wp-json\/wp\/v2\/comments?post=399"}],"version-history":[{"count":0,"href":"https:\/\/corneliadavis.com\/blog\/wp-json\/wp\/v2\/posts\/399\/revisions"}],"wp:attachment":[{"href":"https:\/\/corneliadavis.com\/blog\/wp-json\/wp\/v2\/media?parent=399"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/corneliadavis.com\/blog\/wp-json\/wp\/v2\/categories?post=399"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/corneliadavis.com\/blog\/wp-json\/wp\/v2\/tags?post=399"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}