Oh crap, what fun the last few days has been

Posted: February 6, 2013
Tags: , , ,

I should have seen it coming.

First, I’m on the phone to Microsoft *again* because the it turns out the issue I raised to return the replacement Touch cover wasn’t actually created. So the whole process of doing that again.

Then email from Twitter telling me my account was one of the ones that had its details compromised and that I needed to change my password. A fairly simple process I guess, but with all the various devices and apps that hook into Twitter it seemed more painful than it should have been.

All fairly tedious issues, but sorted out in the end

Then Monday morning I get to work and notice I’m not picking up email from home. I check the “site status” monitor and it reports no connectivity. OK, probably just the ADSL dropping out or something. So I called my wife and asked her to check… things went downhill very quickly as the reason became clear.

My server at home is a self-built box running Hyper-V with lots of memory and disk. After issues in the past with disks failing I went with a 4 disk RAID10 setup this time around. Lots of speed and redundancy as well. The theory goes you can “lose” two disks and the array will keep working. Those odds seemed pretty good.

So what happens when the RAID controller decides to set *THREE* of the disks to offline? Indeed.

There were no errors reported to indicate why the disks went offline, no disk failures or anything obvious. It’s a mystery, but it happened, so let’s deal with the aftermath.

The disks all went back online when asked no drama. The array cam alive again and all the files appeared intact. As it turns out though, the virtual machines that had been running on those disks had other ideas.

All the VM’s that were running reported various “unexpected shutdown” errors of one kind or another, and all *except* for the Windows Server 2012 servers sorted themselves out.

For some reason, the 2012 server took it really hard. The biggest bummer is one of them is my Domain Controller. (Yes, I only have one, perhaps a story for another time)

So the DC didn’t restart. It needed to run a disk check to fix errors, but regardless kept stopping with a bluescreen and restarting. I could start in the recovery console, and see the files, but nothing seemed to get it booting again.

A few more disk checks later and the BSOD error changed. Change is good right? So I tried starting in AD restore mode again and it worked. It’s now looking like a problem with the AD database. Damn, but should be recoverable.

All attempts to repair the DB failed. Messages along the lines of the Jet database was corrupted and had to be restored from backup. OK, backup from previous night reported as good, so system restore it is.

Tried to load Windows Backup, it tells me there are no backups available!?!? Hmm, further checking determines that the Backup Catalogue is corrupted. Try to recreate it fails as well. Not good. Try to open the backup Windows Image (.vhdx) directly and it reports as corrupted as well and wont open. This is getting nasty.

Lots and lots and lots of things attempted and nothing worked. By this time the stress, anxienty, etc were well and truly in control.

This DC is one I built about 3.5 months ago, and I still had the old DC VM (although demoted) turned off. I just hadn’t deleted it yet. I have been able to power it up, but a simple restore from its (working!) backups hasn’t worked yet because the tombstone timeout for the DC is too long ago (>60 days). I’m now in the process of setting the system date time back to trick it into thinking it still last year. Hopefully then I at least get the AD back up in some capacity.

I have avoided thinking what’s going to happen regarding the Exchange server, but let’s just get the DC sorted out first.

