fire against dark background

Dark Theme | Category: Code

My First Time Taking Down Production

“Still missing permissions. Can you add these?”
“Added… hey guys, is there any way that what we’re doing here could cause production to go down?”
“No, shouldn’t be, why?”
“CS is reporting an outage, and… fuck.”

Allow me to describe for you the feeling of realizing you are responsible for knocking out the major SaaS application that pays your bills. When people describe harrowing experiences as feeling like ice in your veins, they aren’t lying. I got cold and started to tremble. A background track of OhGodOhGodOhGodOhGod played in my brain, underneath which a current of I’mGoingToLoseMyJobAndBeHomeless ran swift and strong. I felt like I was going to throw up or pass out, maybe both. I’ve had a close call or two in my life; I know the feeling of primal afraid-for-your-life terror, and this wasn’t too far off.

What Happened

“I broke it oh God I don’t know what I did but this thing just appeared out of nowhere-” I said at a million miles per hour in the DevOps Slack huddle.
“You… don’t know what you did,” my teammate, A, replied slowly and much too calmly for the situation at hand.
“No, yes, I mean-” I took a deep breath. Enough panic, time to handle business. “I know what I did, but I don’t know how it caused this to happen,” I said. I shared my screen and explained as quickly and clearly as I could (and more succinctly than I’m about to explain it to you).  Let’s back up an hour…

A team was trying to run their code, but the database user we had created didn’t have enough permissions to do what they needed to do. On a call with my manager and senior DevOps engineer, I drafted a quick script. They approved it. I tried to run it, but got a syntax error. They had to go, so I said I’d do some more research and hopped on the call with the team I needed to help, to let them know what was going on. Turns out their Tech Lead had the answer (as she always does): my script had one word too many. I deleted that word, ran the script, updated the user. They tried to run their code again – permissions error. The user needed to be able to perform a few more actions. I modified my script with those actions and ran it again.

At this point in the story, you may have a few questions, such as: Why in God’s name were you testing on production? Or perhaps, How did anyone think it was a good idea to modify a user that is responsible for performing multiple automated processes on the platform in the middle of the day on a Thursday? Or maybe just, WTF is wrong with you?? All valid questions, dear reader. To be honest, I’d been on DevOps for six months, and in that time nothing I did felt like it made an impact. Sure there were little quality-of-life things for the team, but mostly it was small improvements, icing on a cake that was already delicious. I was so excited to finally feel like I was making an actual, tangible contribution that I didn’t pause to think about anything else. I flew too close to the sun and got burned.

I don’t want to get too specific since this is related to my job, but essentially the third time I ran the script, something happened that hadn’t happened the first two times which barred this very important user from performing its duties, which caused the automated processes for which it was responsible to also stop. Identifying and fixing the problem did not take long, but it felt like an eternity.

this is fine meme of dog in hat sitting at table surrounded by flames saying this is fine

The Aftermath

As I was owning up to my grave error and taking responsibility to my manager and teammates, the team I had been helping chimed in on Slack to come to my defense. They, too, saw my script and didn’t find anything amiss with it, certainly nothing that would have caused what happened to happen. They also attested to the fact that the script was run twice without incident.

Meanwhile, the DevOps team held a postmortem to determine how we would avoid a recurrence of this issue. My manager initially put the blame on my script and recommended we institute a stricter review process, but several others on the call pointed out that he himself had reviewed the script before it was run even once, along with the senior DevOps engineer and everyone on the team I was helping; no amount of prior review would have detected an issue, and we would have had the same problem. Besides, none of us (to this day) could figure out how my dinky little 5-line script had the result it had, and only on the third time it was run. Our SRE wisely called attention to the fact that testing on production is never a good idea, and I added that I shouldn’t have been touching a critical user’s permissions during regular business hours.

No one was shouting, or angry, or hostile in any way. In fact, everyone was so kind and understanding, and I truly didn’t feel like I deserved it. I was beating myself up, and I expected everyone else to beat me up, too. I think it was their kindness more than anything (maybe also a little adrenaline crash) that made me cry, as quietly and discreetly as possible. In a Slack huddle with my manager and two more senior DevOps teammates, I mustered up the courage to ask, “Since you guys have experience with this type of thing… do you think I’m going to lose my job?”

I feel like I need to interject here, because I know I’m coming off pretty dramatic. Allow me to explain a few key points informing my experience:
1. At a previous job, people were fired all the time with no warning, no PIPs, no indication that they were on thin ice whatsoever, and then the next thing you knew they’d be standing in the parking lot with a box of their personal effects and a case of emotional whiplash. I spent six years at this job and there was maybe a handful of days I didn’t think I was going to get fired, even though I was an exemplary employee.
2. Since entering the workforce full time the longest stretch of time for which I have been unemployed is three weeks, when the owner of the small company I was working for decided to close up shop to focus on her family. I spent those weeks very much panicking that I would never find another job ever again, because my previous two job searches (for my first job after college and then this job that had just come to an end) had taken several months and literally hundreds of applications each, from which I only had a handful of interviews and accepted the first offer I got each time.
3. You know that safety net you have, that person you can borrow money from or crash on their couch or call for a ride at 2am on a Wednesday because everything is going wrong? That’s me. I’m that person. Losing my job, my security and stability, means letting down people who depend on me to be their security and stability when things go sideways. Even if everyone is doing okay right now, there’s just no telling when that could change.

My worst-case-scenario-obsessed brain already had me convinced that I was going to get fired, never find another job in tech, and live out the rest of my days as a disappointment to my friends, family, and boot camp instructors. In response, my teammates both literally laughed out loud and then launched into tales of “I remember my first time taking down production”, which eased my mind greatly. As I’m sure you guessed, I did not get fired.

I hesitate to call this a rite of passage, because obviously it is not something to strive for, and you can definitely be a top-notch DevOps Engineer without ever taking down production. However, there is a part of me that feels like I crossed into new territory this day. With great ability to help others comes great potential to cause an outage, and brand newbies don’t have either of those, so… maybe, just maybe, it indicates… progress?

In any event, I learned that I truly work with the kindest and most supportive humans on the planet, who have my back and support my growth even when that means learning from a mistake. I’m choosing to see the silver lining on this one. Hopefully the next time you make a mistake you can think about this post and think “At least I didn’t take down production”, and if you did take down production… welcome to the club!

</ XOXO>

Enjoy my content and want to show your appreciation? You can share this post, pay it forward by teaching someone else, or buy me a coffee!

[Photo credit: Ricardo Gomez Angel via Unsplash]

Back to the Blog