Tools for Troubleshooting Application Deployment Issues in Cloud Foundry
Our standard demo for Cloud Foundry has us in a directory where either some source code or an application package (war, zip, etc.) is sitting and then we do a
cf push
A handful of messages will appear on the screen, like:
Uploading hello... OK Preparing to start hello... OK -----> Downloaded app package (4.0K) -----> Using Ruby version: ruby-1.9.3 -----> Installing dependencies using Bundler version 1.3.2 Running: bundle install --without development:test --path vendor/bundle --binstubs vendor/bundle/bin --deployment Fetching gem metadata from http://rubygems.org/.......... Fetching gem metadata from http://rubygems.org/.. Installing rack (1.5.2) Installing rack-protection (1.5.0) Installing tilt (1.4.1) Installing sinatra (1.4.3) Using bundler (1.3.2) Your bundle is complete! It was installed into ./vendor/bundle Cleaning up the bundler cache. -----> Uploading droplet (23M) Checking status of app 'hello'.... 1 of 1 instances running (1 running) Push successful! App 'hello' available at http://hello.cdavisafc.cf-app.com
This shows that the application was uploaded, dependencies were downloaded, a droplet was uploaded and the application was started. And that is all fine and good, but what happens when something goes wrong? How can the application developer troubleshoot this?
The answer is multi-faceted and in this note I will try to organize things a bit.
First, let me list the different tools someone might have at their disposal, and briefly what app troubleshooting things it offers:
- the cf cli
- the
cf apps
command – This should be very familiar, it simply shows you the apps you have deployed and an indication of their health - the
cf logs
command – This will show you the contents of the files found in the logs directory of the warden container – these contents will vary depending on where in the app deployment process you are when investigating - the
cf files
command – This will show you the filesystem contents of the warden container – these contents will vary depending on where in the app deployment process you are when investigating
- the
- the bosh cli
- the
bosh logs
command – This will tar up and download the files found in the/var/vcap/sys/logs
directory on the targeted VM. In general, the logs from the dea will probably be the most helpful (dea logs and warden logs), with perhaps something of note in the cloud controller logs.
- the
- ssh into CF VMs
- there is a trick to this when running in the AWS VPC – see this thread: https://groups.google.com/a/cloudfoundry.org/forum/#!topic/bosh-users/Zc0IHbPC47k
- In most cases this probably won’t bring you anything that the bosh logs command doesn’t already, except for this next thing…
- wsh (warden shell) into the warden container for the application
- this is only possible if the application was entirely staged and is up and running. In the event that the application is “flapping,” the warden containers are likely getting killed and recreated on some pretty short interval and it will be hard to get much from wsh-ing in.
Here’s the thing… ultimately your application developer will only have access to the first of these things (the cf cli) and once your cloud is stable, this should be sufficient. While you are getting the kinks worked out of your PaaS deployment, however, the other tools can be very helpful. One other thing to note is that if your developers are enabled with some type of micro-cloud foundry on their workstations, then while they may not have bosh, they would be able to ssh into that machine and poke around, for example, getting to the dea logs directly. I do this all the time on my laptop.
Okay, so now with this list of tools, I’ve crafted the following diagram to give some guidance on what tools will help when investigating things during different stages of the application deployment process. There is definitely a bit of a trick to figuring out where in the lifecycle something went wrong, but even trying to use a prescribed tool for something will give you a hint.
Glyn Normington
Really good summary. Thanks for drawing all this information together Cornelia.
Some further tips…
“cf events” can tell you if warden shot down the app, e.g. because it exceeded its memory limit.
“cf crashlogs” is sometimes useful, but it’s not obvious what files are included and when.
The new Java buildpack, released into production a few hours ago, has a logging framework which you can exploit by issuing “cf set-env JBP_LOG_LEVEL DEBUG”. Logs are generally written during staging, but also if the JVM needs to be killed because of an OutOfMemoryError. The logs are written to standard error, but are also kept in a file which may be accessed using “cf files –app –path app/.buildpack-diagnostics/buildpack.log” if standard error has gone AWOL for some reason.
Guillaume Berche
Thanks for this blog and the diagnostic steps!
I’d add the “cf crashlogs” when staging failed, and “cf events” for flapping apps, along with “cf stats” for understanding out of memory errors.
satch
Got cf-173 deployed using bosh lite and any app I deploy gives me below error. tries all the above troubleshooting steps but it does not give any more info. Help please?
Starting app cf-scale in org me / space development as admin…
REQUEST: [2014-06-18T22:40:59-07:00]
PUT /v2/apps/d08aeec4-4d8e-4d69-8a91-58b2f6c5205c?async=true&inline-relations-depth=1 HTTP/1.1
Host: api.10.244.0.34.xip.io
Accept: application/json
Authorization: [PRIVATE DATA HIDDEN]
Content-Type: application/json
User-Agent: go-cli 6.1.2-6a013ca / linux
{“state”:”STARTED”}
RESPONSE: [2014-06-18T22:41:33-07:00]
HTTP/1.1 400 Bad Request
Connection: close
Content-Length: 154
Content-Type: application/json;charset=utf-8
Date: Thu, 19 Jun 2014 05:41:33 GMT
Keep-Alive: timeout=20
Server: nginx
X-Content-Type-Options: nosniff
X-Vcap-Request-Id: 15567140-8f74-4f12-7185-bcce985a4cb3::76a1e0f7-3db6-46b6-8d90-b751f21b07e0
{“code”:170001,”description”:”Staging error: failed to stage application:nError downloading: Response status: unknownn”,”error_code”:”CF-StagingError”}
FAILED
Server error, status code: 400, error code: 170001, message: Staging error: failed to stage application:
Error downloading: Response status: unknown
FAILED
Server error, status code: 400, error code: 170001, message: Staging error: failed to stage application:
Error downloading: Response status: unknown