Postmortem - ldap outage 10th Jan 2018

Cintia Del Rio and Sparsha.


UTC timeNepal timeAustralia TimeComments



Ubuntu meltdown patch was released (today)

https://wiki.ubuntu.com/SecurityTeam/KnowledgeBase/SpectreAndMeltdown

4:009:4515:00Sparsha was migrating crowd database to salima.
4:2510:1015:2712 new patches showing up in datadog for ako (ldap VM)
5:0010:4516:00

Sparsha detected the error in ako.

"error log showing "docker: error creating overlay mount to invalid argument".


/data folder could be seen, but no tests were done to write files.

5:0910:5416:09No more data sent to datadog from ako
5:2811:1316:28 (question)

Sparsha attempted to restart ako.

No ssh access available after.

Quite possibly the restart applied the new kernel.

5:2911:1416:29ako reported as down in datadog
5:4911:3416:49Skype call and comms sent. Backups were checked and were successfully uploaded to S3.
6:1512:0017:15

VM didn't respond to reboot nor hard reboot from openstack.

Cintia decided to recreate the VM, keeping the data volume (and avoid data loss).

There was the belief that the meltdown kernel patch was applied and caused trouble.

6:4512:3017:45

VM recreated, but data partition was corrupted and couldn't be mounted. 

 Click here to expand...

[  275.419650] sd 2:0:0:1: [sdb] tag#9 CDB: Write(10) 2a 00 00 80 09 08 00 00 08 00

[  275.419651] blk_update_request: I/O error, dev sdb, sector 8390920

[  275.422618] sd 2:0:0:1: [sdb] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

[  275.422621] sd 2:0:0:1: [sdb] tag#10 Sense Key : Aborted Command [current]

[  275.422623] sd 2:0:0:1: [sdb] tag#10 Add. Sense: I/O process terminated

[  275.422625] sd 2:0:0:1: [sdb] tag#10 CDB: Write(10) 2a 00 00 80 08 08 00 00 08 00

[  275.422627] blk_update_request: I/O error, dev sdb, sector 8390664

[  275.492766] JBD2: recovery failed

[  275.492774] EXT4-fs (sdb1): error loading journal

[  275.498303] VFS: Dirty inode writeback failed for block device sdb1 (err=-5).

After several different attempts, Cintia decided it was beyond repair, and tried to reformat partition (even if it meant the data would be lost)

6:5512:4017:55

Cintia decided that the volume should be deleted instead, because it was not possible to repartition it.

Cintia attempted to convince terraform to recreate the volume in OpenStack.

7:2013:0518:20

Even after several attempts, the old data volume couldn't be deleted from OpenStack (neither by openstack cli nor terraform)

Cintia removed the disk from terraform state file, and forced a new volume.

7:4013:2518:40Machine again in datadog
8:1814:0319:18Backup files being copied to /data, after ansible finished.
8:2714:1219:27Backups restored, but crowd refused to connect to ldap, so does telnet from crowd and ID dashboard machines.
8:4014:2519:40We discoverd UFW is configured wrong in ansible for ldap server. How was that even working before???????
9:0014:4520:00UFW is configured and reloaded. Telnet appears to be working again.
9:0814:5320:08Comms being sent as login to JIRA and confluence are working again

 



Actions:

  • Raise ticket to Jetstream about data volume ( to be deleted/investigated)
  • Have pingdom alerts for ldap down
  • How could we have known the disk failed????