Postmortem - ldap outage 10th Jan 2018

Cintia Del Rio and Sparsha.

UTC time	Nepal time	Australia Time	Comments
			Ubuntu meltdown patch was released (today) https://wiki.ubuntu.com/SecurityTeam/KnowledgeBase/SpectreAndMeltdown
4:00	9:45	15:00	Sparsha was migrating crowd database to salima.
4:25	10:10	15:27	12 new patches showing up in datadog for ako (ldap VM)
5:00	10:45	16:00	Sparsha detected the error in ako. `"error log showing "docker: error creating overlay mount to invalid argument".` /`data` folder could be seen, but no tests were done to write files.
5:09	10:54	16:09	No more data sent to datadog from ako
5:28	11:13	16:28	Sparsha attempted to restart ako. No ssh access available after. Quite possibly the restart applied the new kernel.
5:29	11:14	16:29	ako reported as down in datadog
5:49	11:34	16:49	Skype call and comms sent. Backups were checked and were successfully uploaded to S3.
6:15	12:00	17:15	VM didn't respond to reboot nor hard reboot from openstack. Cintia decided to recreate the VM, keeping the data volume (and avoid data loss). There was the belief that the meltdown kernel patch was applied and caused trouble.
6:45	12:30	17:45	VM recreated, but data partition was corrupted and couldn't be mounted. Click here to expand... [ 275.419650] sd 2:0:0:1: [sdb] tag#9 CDB: Write(10) 2a 00 00 80 09 08 00 00 08 00 [ 275.419651] blk_update_request: I/O error, dev sdb, sector 8390920 [ 275.422618] sd 2:0:0:1: [sdb] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 275.422621] sd 2:0:0:1: [sdb] tag#10 Sense Key : Aborted Command [current] [ 275.422623] sd 2:0:0:1: [sdb] tag#10 Add. Sense: I/O process terminated [ 275.422625] sd 2:0:0:1: [sdb] tag#10 CDB: Write(10) 2a 00 00 80 08 08 00 00 08 00 [ 275.422627] blk_update_request: I/O error, dev sdb, sector 8390664 [ 275.492766] JBD2: recovery failed [ 275.492774] EXT4-fs (sdb1): error loading journal [ 275.498303] VFS: Dirty inode writeback failed for block device sdb1 (err=-5). After several different attempts, Cintia decided it was beyond repair, and tried to reformat partition (even if it meant the data would be lost)
6:55	12:40	17:55	Cintia decided that the volume should be deleted instead, because it was not possible to repartition it. Cintia attempted to convince terraform to recreate the volume in OpenStack.
7:20	13:05	18:20	Even after several attempts, the old data volume couldn't be deleted from OpenStack (neither by openstack cli nor terraform) Cintia removed the disk from terraform state file, and forced a new volume.
7:40	13:25	18:40	Machine again in datadog
8:18	14:03	19:18	Backup files being copied to /data, after ansible finished.
8:27	14:12	19:27	Backups restored, but crowd refused to connect to ldap, so does telnet from crowd and ID dashboard machines.
8:40	14:25	19:40	We discoverd UFW is configured wrong in ansible for ldap server. How was that even working before???????
9:00	14:45	20:00	UFW is configured and reloaded. Telnet appears to be working again.
9:08	14:53	20:08	Comms being sent as login to JIRA and confluence are working again

Actions:

Raise ticket to Jetstream about data volume ( to be deleted/investigated)
Have pingdom alerts for ldap down
How could we have known the disk failed????