Saturday, March 10, 2018

Today's deployment: outage/downtime postmortem

Normally I don't like to deploy new releases on Saturdays. 

Generally speaking, I like to push code on Sunday, or even better, on Monday mornings. Mostly this is because those are low-usage times when there are only one or two active users (if any) in production.

As regular readers of this blog and users of this website know, I am bad at software. And, like most people who are bad at software (but insist on doing it anyway), I expect my deployments to be total shit-shows and I like to give myself a nice, long window to untangle whatever messes I inevitably make when I do end up having to deploy code.

At any rate, I went against my better judgment and decided to roll today's release out while there were a dozen or so users in production and, of course, I ended up causing an hour-long outage. 

And so I just took ten minutes to write a postmortem that explains exactly what happened.

Mostly I did this to publicly shame myself (because Shame is the Great Educator), but some of you may find it illuminating. 

So hit the jump for the play-by-play of how I broke the website this morning, if you're into that kind of thing.

And thanks for using the Manager!
Alright, so here's how it went down:
  • First, when I tried to pull the latest release down, I ended up having to merge in some hot-fix (monkey patch) code that I forgot to push from the production host back to git. Of course, I screwed this up and had to stash some things.
  • While merging in production, I ended up causing problems and having to restart the API a bunch of times. Normally the API is fine if you restart it once and just let it live its life. Restarting it a bunch of times in a row while people are making requests, however, tends to piss it off.
  • So, during the numerous restarts of the API that I had to do to monkey patch the issues I created by merging in production, a handful of Mongo DB connections got left open and, during one (fateful) API restart, db errors started cascading.
  • I did not realize this at the time, of course, so for about 10 minutes, overall application performance (i.e. anything that needed to use the database) was SEVERELY degraded. I looked in the logs while writing this post and I see HTML render times (not API response times, but just the HTML renders) of 11-15 seconds, which is insane, since those usually take a second or less.
  • All of which is to say that it took me a minute, but, once I realized that the DB was tits-up, I stopped the API, hoping that open connections would close and things would stabilize.
  • API stopped, I then tried to manually untangle Mongo for about five minutes by closing ports, killing processes, etc. 
  • Needless to say, my "click stuff and see if it fixes it" approach to DB maintenance failed spectacularly and resulted in a MDB service that refused to start. 
  • At that point, I realized I was going to have to do a reboot (since I hadn't done one in almost 100 days and needed kernel fixes for Meltdown/Spectre that I had been putting off applying, because I am bad at system administration, much like I am bad at software).
  • Turns out that, even though the production host is kind of a hotrod, building all the kernel code it needs to run a.) pegs both cores of the processor and b.) takes like, half an hour. During that half an hour dist-upgrade/kernel build window, during which I couldn't restart Mongo, I basically just had to sit on my hands.
Eventually, about 50-something minutes later, the dist-upgrade finished, I restarted the machine and everything came back clean.

No comments:

Post a Comment