Dark Theme | Category:
DevOps Day in the Life: On Call Edition
Early last year I wrote a post about one specific, randomly-chosen day in my life as a Jr. DevOps Engineer, as a (hopefully) more useful take on the question “What’s a day in your role like?” than the typical “Every day is different” or “Coffee, code, meeting, more coffee, more code, rinse, repeat”. Today I’m back with another day in the life, this time an on-call Tuesday. As a reminder, I don’t get too specific when discussing topics directly related to my job in order to maintain confidentiality/privacy/all that good stuff.
How on-call works at my company: There are five of us on the DevOps team and we are assigned to be on call for one-week stretches on a rotating basis. The on-call person is responsible for granting access requests, addressing requests that come in to the DevOps Slack channel, and responding to after-hours alarms.
I start my day at 7:30am. No requests yet so I have some time to work on my own tickets. I’m working on a project related to role-based access, specifically a command line tool to make it quick and easy for developers to generate a token to use as a password for connecting to the database. It sounds simple, and it is until you get into the intersections with other tools we use for access and authentication. Sometimes it’s enough just to have working code; in this case, it needs to work and be foolproof to use and not require a bunch of steps that’s going to be a hassle for our dev team. When I need a quick break from banging my head against the same issue I take ten minutes to embrace my millennialism and whip up some avocado toast for breakfast. I take it back to my desk and munch while working.
When the first request comes in to the DevOps Slack channel I pause what I’m working on and switch over to that. It’s an easy request for something I’ve done before, which is a nice way to ease into an on-call day. Then we have our team standup, in which our manager shares his screen and walks through the in-progress tickets on our kanban board and everyone gives their updates, and then we discuss tickets in the backlog that are on deck once the current in-progress tickets are complete.
The next request that comes in for DevOps is tougher and not something I’ve encountered before, and when I check logs I’m not seeing what I would expect to see. I tag one of my fellow DevOps team members for help, but before they can respond, a tech lead pops into the thread with the good old “turn it off and back on” solution. After some monitoring it’s clear that the solution was successful, so I make a note in my “Lessons From On Call” Google doc about what the error was and how the tech lead solved it so I can do it next time.
As I mentioned in my last “day in the life” post, I prefer to eat at my desk and use my lunch break for a workout or errands or to go outside and touch grass. Today I head to the gym in my apartment complex and take a brisk walk on the treadmill while watching YouTube videos. Sometimes I watch educational videos on tools like Docker and Ansible, but when I’m on call I prefer to just completely clear my head and watch fashion, mixology, travel vlogs, hip hop dance choreography… anything to distract myself from how sweaty and out of breath I am. Afterwards I make a salad and take it back to my desk to resume the workday.
Some access request tickets have been approved, so I tackle those first. One of them is for a server I don’t recognize, so I grab one of my team members and he teaches me how to use a different server as a jump server to connect to the one I need. I update our access-granting documentation so other team members will be aware of this workaround as well if they also can’t SSH directly to the server in question.
Next up is a request having to do with missing logs. I check the place where I think those logs should be coming from and see green lights across the board, but when I try to search them, I see the same error the developer who reported the issue saw. I reach out to another more experienced teammate to ask for guidance, and he shows me how to investigate the issue. I had been looking in the wrong place originally, but when I look where he shows me, I can see the problem. We fix it together and I take copious notes in my “Lessons” doc.
The next request comes to my DMs, because the on-call person is listed in the Slack channel description, so that happens sometimes. New secret environment variables need to be added to several environments. I put up a PR and request a review, and a question comes up in the review, so I have to go back to the requestor for some discussion, then back to my reviewer with the answer. PR is approved and merged.
Another DM request, the final request of the day, is an access issue. It takes a little bit of back-and-forth to figure out the exact permissions needed and then testing to be sure those permissions are sufficient to perform the tasks the user needs to perform. Once the user confirms they have what they need, that’s a wrap on the day… for now.
The Middle of the Night
My phone chimes. Glasses-less and bleary-eyed I check the time: 12:35am. There’s a text message from OpsGenie showing a high CPU alert on one of our database clusters. I acknowledge the alert in Slack on my phone, then climb out of bed as carefully as I can. My husband sleepily asks what’s going on, to which I respond “Just on-call things”, and I hear him settle back down in bed. Meanwhile, I go to my desk and check out the cluster in question. Sure enough, CPU utilization is high as a kite. I take some screenshots of database metrics and drop them in the alerts Slack channel, then hunt down the long-running SELECT queries and start killing them. Once CPU usage is back at an acceptable level I make note of the offending queries in the channel and return to bed, where I do word searches on my phone until I can fall asleep again in preparation to tackle tomorrow.
Enjoy my content and want to show your appreciation? You can share this post, pay it forward by teaching someone else, or buy me a coffee!
[Photo credit: Claudia Mañas via Unsplash]