Skip to main content

Wade Armstrong

Did you get called for an issue? This checklist will help you make it through. Do it in order.

  1. If you were woken up, they can wait 2 minutes while you start some coffee. Don't make them wait much longer, but get that coffee started!
  2. Join chat. Announce your presence, and what team you're here for
  3. Slow down. Take a few deep breaths.
  4. If this is an incident, make sure the Incident Manager (IM) explains to you exactly the current understanding of what's happening.
  5. If this is not an incident, then feel free to ask NOC to clarify what they see as going on
  6. Remember that you may be able to fix the problem, but your biggest job is to route the problem to other people who can fix it.
  7. Ask questions to confirm that they're actually dealing with your technology, and not some other part of the stack.
  1. Once you know it's your tech, ask questions to make sure you understand the details of the issue:
  1. If, at this time, you don't understand what's going on or don't understand the technology, call your backup (or the tertiary)
  2. Make sure that you agree this is a real emergency. If you don't think the degradation is that serious, speak up and explain why. You may be right, you're the expert!
  3. Take some more deep breaths. Look away from the screen. Make sure you have water/caffeinated beverage/etc. Take a sip. Slow down.
  4. Remember that you may be able to fix the problem, but your biggest job is to route the problem to other people who can fix it.
  5. If it is an emergency, find specifically what component is the problem.
  6. As you're working, if you're unsure about anything, ask the IM, or anyone else whom you recognize/trust, offline. Everyone wants to help!
  7. When looking for causes, note that, during an emergency, only 2 things can be done, so focus on solutions involving these:
  1. If you have a fix you're confident in, speak up.
  1. If you don't have a fix you're comfortable in, it's ok to ask for help - It's ok to act as a router, and not to directly fix
  1. Write up what you did in the on-call diary