You probably know already that this Monday there was a world-wide outage of many Google services for about 45 minutes. The thing to be noted is that the affected services were also the most important Google services which have the most users, which includes Gmail, YouTube, Google Cloud, Play Store etc… I don’t think Google Search was down though, cause that would have caused a different level of outrage since it is similar to saying that the world-wide-web is down.
The approximate timing of the outage seems to have been from 5:20 PM IST to 5:50 PM IST (3:50 AM PST to 4:20 AM PST) which is not that big of a deal if it was a regional website, but since it is Google it was all over the news.
In this post, I’m going to go through the things that software engineers supporting the application do in the case of an outage. But before we cover that, let me go through who exactly handles these outages.
On-call is the practice of designating specific people to be available at specific times to respond in the event of an urgent service issue, even though they are not formally on duty.atlassian.com
The above sentence sums up the crux of it, but to add to that, the people on on-call support are usually those who are in the team that works on that application. In some cases there might be a dedicated team just to support and ensure that the application stays up-to-date, also called Managed Services, in which case the on-call support is handled by them.
So now that we know who actually handles these outages, let us go through the lifecycle of an outage.
Step 1: Incident is Reported
An incident could occur at any time of the day, and people may or may not be awake during that time. Some incidents are first reported right from the monitors that are in place which notifies the on-call team regarding the same. In other cases, someone (internal or external) notices the issue reports it in some platform. I believe big tech companies never go through the latter case, but I might be wrong.
Anyway, the project manager and the on-call team for the project go through the report and acknowledge it. Issues of different severities have different threshold times to acknowledge and to resolve and are based on the SLAs signed.
Step 2: Investigation begins
Now that the incident has been acknowledged, the on-call developer person takes a look at the issue and tries to see what the exact impact is. If the impact is in fact huge, like login not working, then the developer gets right into finding the root cause and the solution.
In case the impact is very small and not that big of a deal, then it is de-escalated and left to be looked at in the work hours. Everybody goes back to sleep.
Step 3: Find root cause and possible solution
Seeing that it is a big deal, the developer tries to brainstorm what could be the cause of the issue. It could be because of a patching job in the infrastructure run before the incident was reported which could be the reason for the issue. It could also be an issue in the code that went live before, but it is coming into view only now since the feature has not been accessed before.
This step of the handling is the most important piece, since the faster you can do this, the better is it for you to resolve the issue and thus eliminate any impact to the users. Once the developer has found the root cause, and is able to find a solution, he/she then works with the lead and manager to push the fix.
Step 4: Push the fix
Pushing the fix could either be simple as adding back a index that got deleted by mistake, or making a code fix which would have to go through the entire software lifecycle quickly. In the case it being a severe issue, some of the steps in the life cycle might be skipped.
If it is a simple fix that does not need a code release, it can usually be done by developer if he/she has the required access. If not, he/she will have to contact the person with the access to make the fix.
In the case of a code fix, it will have to go as hotfix release which would be out-of-cycle unlike the other planned releases.
Step 5: Verify and close incident
Now that the fix has been made, the developer verifies the fix and closes the incident, which in turn notifies the incident reporter and all other stakeholders. Simple as that.
Step 6: Prepare RCA document
This step is not done when handling the incident, but is done after the incident has been fixed and closed.
An RCA (Root Cause Analysis) document answers different questions like the timing of the issue, who reported the issue, cause, impact, mitigations done and things learnt.
And this is how software engineers handle outages.
Hit the like button if you liked it, and write a comment about your experiences with handling incidents.
Until then, adios!