Debriefing and timeline of recent outage (March 31)

Edcel Ceniza Updated by Edcel Ceniza

We have recently confirmed that were was an outage that caused a fraction of our DD-US customer's instances to be unreachable. Here is the timeline and details of what actions were done from our end.

25th of March 2022

  • Received first report of the issue from one customer. Degradation of performance lasts one hour on 5AM NZT
    • Confirmed TECH portal, admin portal both experience slowness. Ruled out PSA API as the culprit.
    • Checked last CPU and memory graph of the database for the past 24 hours. Nothing obvious was detected. AWS is unable to report correct memory usage. CPU is always under 60%

30th of March 2022

  • The same customer and few other customer reported slowness. Degradation starts from 5AM NZT until 11:30AM NZT
    • We noticed high CPU and memory usage on one of IIS VM. This indicates high usage on two of our customers.
    • Diagnosed on query per seconds, ruled out the possibility of high usage by automation. There is little to no query done by automation through API key for these two instances
    • Diagnosed on any query against that two instances, ruled out possibility of high usage by client or TECH portal. Their usage is around 13 request per seconds.
    • Confirmed database web studio is slow.
    • Confirmed database query is slow.
    • Database was taking 700~800 request per seconds. This is expected, no issue here.
    • Restart database didn't resolve the issue
    • Upgraded database from 8 core and 64GB memory to 16 core and 127 GB memory on 4:30PM NZT. (Around 9PM USA West time)

31st of March 2022

  • Several more customers reported slowness. Their instances now almost unusable for majority.
    • Ruled out the under resourced server theory as hardware upgrade didn't resolve the issue.
    • Checked disk I/O, no slowness detected
    • Checked event log, nothing obvious detected. No error
    • Database web studio is still slow
  • Called database provider and had online session with them.
    • Notice error report on database unable to write journal file, but this happened last Sunday.
    • Created memory dump of database
    • Recommended upgrade database from 5.1 to 5.3
    • Restarted database, issue persisted
  • 30 min after call with database provider, we have decided to upgrade database. Since this has affected several DD instances and were unusable
    • Shut down database VM, did AWS snapshot on the disk. Since it is large region, it took almost 2 hours.
    • Upgraded database and restarted.
    • Disabled database daily backup.
    • Slowness is gone.

Further diagnosis:

  • From all the database history error log, ruled out the relationship of error with incident. Error on write lock on journal file is related to weekly windows update.
  • Database daily backup setting display incorrect after upgrade to 5.3. But confirmed by database provide it is only UI error.
  • From database log, daily backup only took around 1 hour each day on 6AM UTC. The slowness is not caused by backup.

Received first diagnose from database provider (9PM NZT), from memory dump, there seems to be corruption in database's heap memory. Whether it is related to the incident or not, they cannot 100% confirm that. Provider also cannot reproduce on latest version.

1st of April, 5AM NZT

- Verified that the database is no longer suffer degradation of performance. Database studio also load very fast. No other issues reported so far. Will monitor further

Plan

  • Re-enable daily backup on 2nd of April, 4PM Sat NZT.
  • Monitor database through automation, check if we still suffer from slowness.
  • Setup schedule to upgrade all other database to latest version

---------------

2nd of April, 2022

  • We tried to re-enable database backup, as soon as it is enabled, database suffer huge performance penalty.
  • Disabled backup, no longer suffer performance penalty.
  • Contacted database provider on our findings and scheduled to diagnose further

4th of April, 2022 NZT (USA Sunday).

  • Spent whole day on diagnosing the database.
  • Created many diagnose packages based on provider's instruction, either with backup on, or without backup.
  • Verified how long before performance hit bottleneck after enabling backup.
  • Disabled backup and waiting on further instruction from provider.

5th of April, 2022 night NZT, (USA Monday night)

  • Spent 1.5 hours with provider to diagnose on live. The conclusion is the bottleneck might not be database fault but related to disk IO starvation.
  • Reduce concurrent backup to 1, no more performance issue.
  • Prepare for further diagnose on IO.

6th of April 2022

  • Checking disk IO configuration and prepare for solution.

6th of April 2022 night NZT (USA Tue night).

  • The performance issue start to raise again. Completely disabled backup.

7th of April 2022 morning, (USA Wed morning)

  • The performance issue recurred without backup enabled. We have decided to switch disk to different type with higher consistent throughput. (double the throughput on US-2)

7th of April 2022, 3PM NZT, (11PM New York time, 8PM Los Angeles)

  • Stopped database US-1 region, start whole disk backup
  • Updated database on US-1 region to latest version
  • Switched US-1 region's disk to higher version, also double the throughput than before.
  • Moved 7 customers from US-2 region to US-1 region, to decrease load on US-2 region.

8th of April 2022, 8AM NZT (5PM)

  • Verified last 12 hours IO load and database performance, no impact, no performance issue.

8th of April 2022, 3PM NZT (11PM New York time, 8PM Los Angeles)

  • US-1, US-2, EU-1, CA-1 regions now uses additional disk for databse temp files. Decrease exist disk's IO stress
  • EU, CA, AU region's IO throughput now been doubled
  • US-1 region's IO throughput now been quadrupled compare to last week.

Future Plan:

  • 9th of April afternoon, AU region database should have new temp file disk to decrease database load.
  • 9th of April, we will re-enable US-2 region's database and close monitor on it.
  • Closely monitor on US region in next 4 weeks, see whether problem occur again with IO throughput increased
  • In the future, based on this event, we will setup measurement to review our infrastructure once a month, to make sure our hardware is sustainable with customer and usage increase.

How did we do?

Contact