Postmortem Error 500 — General

Jean Victoria
2 min readFeb 27, 2020

Issue Summary:

  • Start time: 20/02/2020 11:21 AM (GMT−5), End time: 20/02/2020 11:53 AM (GMT−5).
  • The page was returning a 500 status code, so the page was down for a 100% of the users.
  • Root cause: typo in a settings document and other script.

Timeline:

  • 20/02/2020 11:23 AM (GMT−5) — The issue was detected by several users, who contacted the customer service department.
  • 20/02/2020 11:25 AM (GMT−5) — The issue was escalated to the System Engineering team, and the SRE.
  • 20/02/2020 11:27 AM (GMT−5) — They looked at the running processes on the server using ‘ps auxf’ to see if any unwanted child process was running in the background, and keeping the server from responding.
  • 20/02/2020 11:29 AM (GMT−5) — After seeing the processes looked fine, the team used ‘strace’ on some process ids including the web server hosting the page.
  • 20/02/2020 11:30 AM (GMT−5) — strace on one of the server processes was showing an infinite loop of system calls, so they looked at the second server process, that was calling the system call accept4() and hanging.
  • 20/02/2020 11:35 AM (GMT−5) — When using curl on the page’s IP while running strace on that second server process, the team realized strace was displaying a lot of errors. One of them said that the file index.html didn’t exist, but it was a misleading clue because adding that file the folders didn’t seem to make it work.
  • 20/02/2020 11:40 AM (GMT−5) — After reading carefully all the errors returned by strace, the team saw that one of them mentioned that a file didn’t exist: the file that server was trying to access seemed to be terminating in other extension, which is not a common extension for a file.
  • 20/02/2020 11:47 AM (GMT−5) — When looking at the settings file, /var/www/html/settings.php, line 89 was trying to require that faulty file. From then, the team just removed the extension and replace for the new extension.
  • 20/02/2020 11:53 AM (GMT−5) — The team only had to restart server. The page was back up like normal.

Root cause and resolution:

  • One typo in the wordpress settings fileOne typo in the server settings file was found, causing server ande the page to not work properly.was found, causing apache2 to not work properly.
  • The issue was saved by removing that typo and restarting server.

Corrective and preventative measures:

  • Setting files should not have write permissions for anyone else than the SRE, in order to avoid injection of small typos like the one that was experienced in this incident.
  • Change permissions on /var/www/html/settings.php to read-only for the team.
  • Read carefully all setting files to look for other typos of that type.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

No responses yet

Write a response