Disaster Recovery – A personal experience

Posted: February 7, 2013 in HyperV, Solved, Windows General
Tags: , , , , , , ,

Let me say this first of all – Do not try this at home. The following is not a “strategy”, it was a last desperate act to recover *something* from *nothing*. In this case it seems to have paid off, but there is still a lot of work to do.

Following on from my previous post about the serious disk crash experience…

After spending more time than I should have curled in a ball, rocking back and forth and hoping something would just start working by itself, I finally realised that I didn’t really have any proper options left. I was at the point where *anything* was an option, there was no longer a wrong way to do things.

So to summarise:

The Hyper-V host machine decided to drop three of the four disks from the RAID10 array that all the virtual machines hard disks reside on. There is still no reason I can find for this. Re-activating the disks started the array again and everything was visible. However the sudden loss of the virtual disks really messed up the virtual machines themselves.

Interestingly enough, the 2008, 2008R2 and even the single 2003 servers all managed to recover after a simple “Unexpected shutdown” disk check on startup.

The two Windows Server 2012 machines on the other hand really did not handle the disk drop well at all. One of them was the Domain controller.

  1. The result for the Domain controller was that the AD database was corrupted, so the server wouldn’t start (Blue screen)
  2. Several recovery console chkdsk and sfc scans later, I was finally able to boot it into AD Recovery Mode
  3. In recovery mode, tried to repair the AD database. No luck. Not even a force repair would work
  4. Loaded up Windows Server backup to do a State Restore – The backup catalogue was corrupted!
  5. Try to Restore the catalog from the backup – No valid catalog found, the backup has been corrupted as well
  6. Try to open the disk image (vhdx) that Windows Backup makes when it does a state capture – Invalid disk

So maybe there was some deep-dive type magic a guru could do on the database, but it was way beyond my abilities, and spending $400 for a Microsoft Incident on the extremely slim hope they could recover it didn’t appeal.

A quick diversion into recent history – This 2012 DC was a new one I built last October (a bit over 4 months ago) and migrated all the roles and services over from my then 2008R2 DC. The 2008R2 DC was then demoted and shutdown. It was still sitting there in an off state, and I hadn’t “got around” to deleting it yet.

So back to the present. I figured there hadn’t been a lot of changes (more on that later) since I swapped the DC’s, so maybe it would be possible to startup the old one and restore one of its backups from before I swapped it all over to the new server. As this machine hadn’t been running at the time, all of its systems and backups were perfectly fine.

The main challenge with this revolved around the Active Directory tombstone timeout. The default is 60 days, and I hadn’t changed it at all. This is effectively a protection mechanism that AD has to prevent old objects being accidentally re-introduced into the directory. In this case I didn’t care, but because the backups I needed to use were older than 60 days, some smoke and mirrors was required

SOLUTION

So the steps I ended up doing:

  1. Boot the old server into AD recovery mode
  2. Set the Guest VM date to the day after the backup I want to restore (e.g. backup=5 Oct, set date to 6 Oct)
  3. Make sure the settings for the Hyper-V guest have the Host Time synchronisation NOT SELECTED
  4. Do the system state restore with the Authorative AD restore – DO NOT AUTOMATICALLY RESTART the server
  5. Shutdown any other Virtual machines running (don’t just pause or suspend them, shut them down)
  6. Change the date on the Hyper-V HOST computer to the day after the guest machine (7 Oct). The reason for this is that Hyper-V will ALWAYS set the guest computers “BIOS” clock to match the host whenever the guest restarts, EVEN IF THE GUEST TIME SYNC OPTION IS DISABLED! If you don’t do this the restore will fail because the Guest will sync the date, then see the AD it is trying to restore is more than the tombstone timeout (60 days) and fail to restore correctly
  7. Restart the Virtual DC
  8. When it finishes booting, you will see a message saying the restore has completed successfully. This is good
  9. Set the date on the HOST back to the correct date. You can now restart any other Guest machine
  10. Wait 15 minutes for the “new” AD to settle a bit, then jump the date forward 1 month
  11. Wait another 15 minutes and then jump the date another month
  12. Repeat until you get to the correct date
  13. Do an NTP time sync in the guest to a valid internet time source

So that all seemed to work fine. I don’t know if the 1 month date jumps I did were really neccesary, but I felt it safer to at least give the AD some time to adjust, even if 1 month at a time is still a huge jump.

So now we have a working DC, but its database is 4 months behind what all the member servers and workstations are at. This means the secure channel between all the machines is likely broken.

Simplest fix is to just “rejoin” all the computers to the domain again. This will refesh their connection. There is a neat trick I use here to save needing to switch to a workgroup and back to the domain

  1. Open the System properties on each Member computer to the Domain/Workgroup setting
  2. Change the Domain property from the FQDN to the NetBIOS name of the domain (e.g. Change “CONTOSO.LOCAL” to “CONTOSO”)
  3. The wizard sees this as a change of domain and will then prompt for the account to connect and join with
  4. Reboot the member server

All sorted. The computer will just refresh the existing object in AD, and it’s like nothing ever went wrong.

So now I’m about 80% recovered. A few issues with corrupt SQL Master databases which I need to investigate, but I’ll just rebuild from scratch and reimport my databases if required.

The last BIG challenge is my Exchange server. I just realised that I had also moved my Exchange 2010 to a new server *after* I had demoted the old DC, so my newly recovered domain controller has no knowledge of my new Exchange at all. That will be a new thing to figure out, and a new blog post I think. Finger crossed.

Comments
  1. Joe Hickman says:

    Beautiful. I was trying to troubleshoot a computer I was remote connected with 1500 miles away. SQL Server service wouldn’t start. I wasn’t looking forward to removing and reinstalling SQL Server over long distance. Your solution worked perfectly. After copying in new versions of the master db files, I just had to attach the application databases and setup security for the user. Thanks.

    By the way, I hate computers too🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s