Debugging Windows client logon delays: Narrowing the scope
There are several potential causes of slow client logons. The question is, once you find the source, how do you fix it?
Part 1 | Part 2
In my previous article, I described the basics of troubleshooting poor client logon performance in Windows. I will now dig a little deeper into how to develop an action plan to eliminate possible causes and, hopefully, find the problem.
Performance, of course, is always a challenge to write about because 1) everyone has a different view of acceptable performance and 2) there are many variables – hardware and software – that can affect performance. I do Active Directory-related troubleshooting for my day job, so that's the context in which I've put this article. I have worked on a number of these issues and will rely on that experience to describe how to attack these problems.
The first thing you need to do is prepare a list of possible causes for slow client logon in general. This could probably be developed into a flow chart, but for now we'll use a couple of lists and refer to them as we diagnose the problem.
Known causes of slow client logon performance
As I wrote in my previous article, here is a quick summary of what I've found can cause client logon delays in Windows. These are not listed in any particular order, and each could be at fault for any given situation:
- Domain controller is unavailable or very busy
- DC overwhelmed by LDAP traffic
- DC also runs Exchange, SQL Server, File/print, etc.
- Client is getting Group Policy from an out-of-site DC
- Network traffic (startup/logon traffic is directly tied to the number of groups and GPOs that the computer and user are members of -- very predictable)
- Roaming profiles are slow to load
- Inefficient logon scripts
- Inefficient GPOs (filtering, restrictions)
- Large number of GPOs and/or security group memberships
- Network components (drivers, switches, link speeds, dual-homed, network cables, etc.)
- Applications and services are starting on the client at boot
- Antivirus updates, Windows Update downloads
- Faulty images
There are probably more possibilities, but this is a good list to start with.
Now let's examine some questions to ask in order to narrow the scope. This list is in the order that I would ask the questions. Each question is followed by a list of troubleshooting steps to resolve the issue. You will likely find more than one of these will apply, so organize the steps into a logical sequence for an action plan.
- When did this start?
This is tough since you are relying on calls to the help desk, which may not be entirely accurate since some users often just learn to live with these issues. Interview the user and pin down the start of the problem. Then look at what changed, such as software installations, network changes, GPO changes or perhaps another problem that was solved with a hotfix. The answer to this will affect the rest of the questions you ask. (for example, it might be time to move to 64-bit DCs!)
- Who is affected?
This is difficult because once again you have to rely on help desk complaints.
- One user – Investigate other users that are in the same location and security groups, using the same hardware, etc. to make sure the problem is affecting only one user. Focus on local settings, profiles, workstation configurations, groups, and so on.
- Users in only one site – Look for problems at the domain controller or networking issues in the subnet(s). Examine domain controller performance to see if the DC is overwhelmed and can't handle the load. The LogonServer environmental variable should be examined on each client to determine which DC is authenticating them -- don't assume they are authenticating to a DC in the site as this can change. See if the "problem users" are all authenticating to one DC.
- Users across sites – This could be the result of a network issue, new patch installed, etc. Look for something in common among affected users, including when the problem was first seen.
- New clients installed since a certain date – Perhaps these users have a new image or OS?
- Terminal Services users – Look into local vs. roaming profile issues and terminal server load.
- Does this happen at the same time every day?
Have the user log on and off at different times during the day, such as 10 a.m., 2 p.m., 7 p.m. or any other time when logon traffic is light. If the problem goes away, then you can focus on network traffic and DC performance during peak logon periods.
- Do you have sites across slow-linked networks?
It is possible – and even common – for clients to authenticate to a local domain controller and get policy from another DC due to the way SYSVOL finds random DFS servers. It is also possible for a client to get policy from a DC in a poorly connected site, and it will change so the problem could be intermittent.
I don't know of a fix for this but have heard that a possible workaround is to hard code the LogonServer environmental variable to a specific DC. If this works in a test, then implement it only on problem clients. I have not done this, but it is worth consideration. The DC used for GPO loading is found in the GPResult output. Run GPRESULT /v on the client.
- What did you change when this started?
The most common response to this is "nothing". After some digging however, you'll usually find something.
- Can the affected user reproduce the problem by logging on to another computer?
In other words, does the problem follow the user? Or can another user who doesn't have the problem logon to the affected computer and experience the same issue? If you can determine that the problem is tied to the computer itself, it will narrow your attack.
- Are you using roaming profiles (perhaps on some users and not others)?
Check the network share and look for roaming profile issues. Also, follow the steps in part one of this article to enable verbose logging for Userenv logs and examine it for more information.
- Is the user having long delays when logging off?
This can also cause logon delays due to a bloated profile and registry. For Windows XP and earlier versions, consider implementing the Microsoft User Profile Hive Cleanup Service (UPHClean) to clean up local profiles and registry. UPHClean is implemented in Vista.
- Are the affected users remote access clients?
Perhaps the users only have a logon problem when using a remote access connection. Look at your remote connection software or VPN setup and try building a generic Windows connection rather than using your custom connection software. Your ISP and network performance could also be an issue here.
Here are some additional tips for finding the cause of these delays. You can find more details on some of these in my previous article.
- Find a test client. Ideally, you should be able to get a workstation and reproduce the problem without bothering a user.
- Download and run MPSreports from Microsoft on the client and DC. This collects data for all event logs, MSINFO32, NetDiag, drivers, hotfixes and more. remember, the more data you have, the easier it will be to track down the problem.
- Run PerfMon on the DC and client, and see if you can match the time of the client problem with some performance spike on the domain controller.
- Run a network trace and try to determine what is happening during the logon process that causes the delay.
- See if the problem happens at a specific time of day and if so, examine what is happening at that time. Suspects include AV and Windows updates, scheduled jobs, and client survey software.
- Review GPO settings. Known performance hits can come from ACL and WMI filters, loopback processing, etc. Determine if any GPO settings were implemented at the time this problem started. Check out my article on using Userenv logs to debug Group Policy and profile issues.
- If the problem follows the user (see item 6 in the question list above), try copying the user account to create a new user. If that account has no problem, recreate the account. I have seen this work in some cases. This test also eliminates the profile. You should try deleting the user's profile to see if that fixes the issue before recreating the account.
- Review the logon scripts. They can grow little by little until they become unwieldy and ineffective.
As I stated before, there are no easy solutions to this problem and it can take a lot of time to debug. The best attack is to review the possible causes, ask the right questions to narrow the scope, and use the tools noted here to gather and analyze data to locate the cause.
TROUBLESHOOTING POOR CLIENT LOGON PERFORMANCE The basics Digging deeper
|Gary Olsen is a systems software engineer for Hewlett-Packard in Global Solutions Engineering. He authored Windows 2000: Active Directory Design and Deployment and co-authored Windows Server 2003 on HP ProLiant Servers. Gary is a Microsoft MVP for Directory Services and formerly for Windows File Systems.|