I recently worked in a situation where an entire forest root domain had to be recovered. The structure itself was relatively simple. It consisted of two domains; an empty forest root and a child domain with all the users, computers, etc. It also only had about 4,000 users.
But there were two (almost fatal) problems. First, the organization had built only one domain controller (DC) in the root domain. Second, to make matters worse, that DC had not been backed up in more than 10 months. Although the root DC had a RAID-5 disk configuration, disaster struck and two of the drives failed on the same day.
This type of configuration likely resulted from a best practice Microsoft espoused in the early days of Windows 2000. The recommendation at that time was to create an empty root domain so that child domains could be added and removed if the names changed (a domain could not be renamed once it was defined).
This philosophy is no longer followed, however, since multiple domain forests have other complexities: restoring back links between groups and users in cross-domain groups, lingering objects held in the read-only context of Global Catalog servers, and other related issues. To avoid these issues, some organizations collapse their multiple domain structure into a single domain.
In this example, the two domains are Corp.com and EMEA.Corp.com, with Corp-DC1 the domain controller in the root domain and EMEA-DC1 and EMEA-DC2 in the child domain.
Note that all clients -- including users, workstations, and servers -- were unaffected by the issue, giving us time to develop and enact an action plan.
This situation presented several questions and challenges, including:
- I had never seen a case where the forest root domain had to be recovered -- and I couldn't find anyone who had.
- Recovering from a 10-month-old backup would inject lingering objects into the forest on the good child DCs.
- What problems would we face in changing the system time on the root DC when recovering the January backup?
- Would the trust between the Corp.com and EMEA.corp.com domains have to be repaired? Likewise, would secure channel passwords have to be reset?
- Would it be necessary to use an authoritative backup or not?
- What kind of replication issues would be experienced when the January backup of Corp.com was restored?
Still, there were some positive factors in this disaster as well:
- There were no users or workstations in the root domain – only administrative accounts and domain controllers. As a result, there was little danger of lingering objects being injected when the 10-month-old backup was restored.
- No changes (i.e. AD objects) had been created on the root DC (though changes in the configuration container would be of concern).
- DNS was delegated to the child domain. Therefore, as far as the clients were concerned, EMEA.corp.com was DNS-independent and there were no resources in the parent domain.
The recovery plan
An initial idea was to roll the EMEA DCs to their January backups, restore the Corp DC, roll the Child DCs forward, and then let it all catch up. This 20-step process required several days of downtime, and it was rejected because of its complexity and destructive nature.
We ended up using the following, simpler plan:
- Restore current backups from two child DCs -- and the January backup of the root domain DC -- to three computers on a private network.
- Solve the problems, and then repeat the steps in the production forest.
- Add a second DC in the Corp.com domain.
- Backup all DCs in both domains.
The process took about three weeks and most of the time was spent studying logs, doing the restore, etc. It was deliberate and methodical, as we wanted to make sure everything was done properly. In addition, users experienced no downtime. This means that although the forest seemed precarious without a root domain, it functioned well for user authentication and so forth while we worked on the restore. When we did the production restore, we did it during business hours without affecting the users.
The recovery process
The recovery process consisted of the following steps:
- Obtain three computers and configure them on a private subnet.
- Restore the current system state backups from EMEA-DC1 and EMEA-DC2 to the test computers.
- Restore the January backup from Corp-DC1 to the test computer.
- Set the system time on the January backup of Corp-DC1 to the current date/time.
- Set the tombstone lifetime to 365 (max) to eliminate problems with lingering objects for the time being. Modify the tombstone lifetime attribute with ADSIEdit at cn=Directory Service,cn=WindowsNT,cn=Services,cn=Configuration, dc=pp.
- Set the strict replication consistency registry key to "1" (strict) to prevent lingering objects from replicating.
- Uncheck the global catalog option on Corp-DC1. Re-enable it after replication settles down.
- Health-check the DCs using HPSReports. Work through any errors one by one until it is clean:
- Netdom Trust /verify, to verify the trust between Corp and EMEA domains.
- Repadmin/Replsum /bysrc /bydest /sort:delta, to test replication on all DCs in the forest.
- DCDiag /test:DNS /e /v , to test DNS problems on all DNS NS in the forest.
- All event logs.
- Make sure you get 1704 (SCECLI) events in the Application event log indicating Group Policy is being applied. Also, check GPResult output from each machine to check GPO health.
- Ensure you can logon to a computer in the EMEA domain with a Corp.com account -- and vice-versa -- to further verify the trust.
- Add clients from the production EMEA domain to the test EMEA domain, and see if they authenticate.
- Add users and sites on a domain controller in each domain and see if they replicate to all DCs. This tests the domain and configuration NC replication.
- Once all problems are worked out, repeat the steps for the production forest.
- After the production root domain DC (Corp-DC1) is restored, promote a second DC in that domain. (A second DC in the root domain would have prevented this issue.)
- Schedule backups for all four DCs.
- Reset the tombstone lifetime attribute to a minimum of 120 to 180 days. Make sure the strict replication consistency value remains at 1.
ValueName = Strict Replication Consistency
Data Type = Reg_DWORD
Value Data =1
C:>netdom trust Corp /domain:EMEA.corp.com /verify
The trust between Corp and EMEA.corp.com has been successfully verified.
The initial results showed a number of errors and warnings in the event logs and some errors in the Repadmin /showrepl reports. Many of these errors occurred because the system was trying to get settled, and after running it overnight, most of the errors fixed themselves. We then we worked on the remaining events until all of them were resolved. The test and production environments yielded similar results.
- Since dynamic registration was not enabled, there were some DNS issues. As a result, we had to manually configure some DNS records.
- After the initial restore of the Corp-DC1 DC in the root domain (from the old backup), an assortment of events were found in the directory services event log, including:
- 1869 -- Found a GC in Site-LAN (refers to EMEA-DC1).
- 1655 -- Can't find GC in one of the sites (refers to the EMEA-DC).
- Events 1869 and 1655 were logged for the EMEA then the Corp-DC1 server.
- Some 1311 events.
- Several replication failures related to DNS Lookup failure.
- Using the DCDiag /test:DNS /e /v report, we found DNS worked as expected.
- There were a number of W32Time events -- Event ID 29, 24 and 22 -- that, without further action, disappeared in time.
- After the old restored Corp-DC1 was brought online, there were a lot of warning and error events at first. We let it go for 12 hours, however, and they all fixed themselves.
A number of 1869 and 1865 events had difficulty finding a global catalog. In spite of all the events, replication worked, which we discovered by running Repadmin /replsum /bysrc /bydest /sort:delta:
Overall, the restoration worked extremely well and relatively error-free. It was accomplished with no downtime and very little risk to the environment. Authoritative backups did not have to be used and the trust did not have to be repaired. We had the confidence to put this plan into production because we had tested it in a test environment. Still, this is one of those situations where you think "this should work" -- but you don't really know until you try it.
|Gary Olsen is a systems software engineer for Hewlett-Packard in Global Solutions Engineering. He authored Windows 2000: Active Directory Design and Deployment and co-authored Windows Server 2003 on HP ProLiant Servers. Gary is a Microsoft MVP for Directory Services and formerly for Windows File Systems.