WSUS SUP causes high CPU and clients fail updates scan

Posted: August 13, 2017 in Configuration Manager, IIS, Server 2012, System Center, Windows 10, Windows Update, Work in Progress
Tags: , , , ,

NOTE: The configuration suggestions I mention in this post won’t fix the underlying issue. Depending on the size of your environment they may be enough to get things working for you again. Microsoft is currently working on releasing a hotfix that I have tested and found to resolve this problem

 

Microsoft have released the WSUS server hotfix, details here: https://blogs.technet.microsoft.com/configurationmgr/2017/08/18/high-cpuhigh-memory-in-wsus-following-update-tuesdays/

NOTE2: It turns out there is a new issue from the August 2017 updates that “clears” the update history on a computer that will trigger a full client scan again. This will also cause high load on your WSUS server, although for slightly different reasons, however the suggestions here and the coming updates will help to resolve the load issue from that problem as well.

Microsoft have updated the August cumulative updates to resolve this issue, details here: https://support.microsoft.com/en-us/help/4039396/windows-10-update-kb4039396

 

NOTE3: Microsoft has now published some additional official guidance here: https://blogs.technet.microsoft.com/askcore/2017/08/18/high-cpuhigh-memory-in-wsus-following-update-tuesdays/

This issue is one I first encountered on only a couple of our WSUS servers (2 or 3 of 15 servers) last year in November 2016 after the new cumulative update process was introduced for patching. At first I assumed it to be a failure on my part to do more regular cleanup, or a result of the recent upgrade to ConfigMgr 1610, or an “end of year” rush of activity on the network. This isn’t unusual for the environment I currently manage (Education with approx. 370,000+ devices)

At first I looked at server bottlenecks (we run everything in VMWare) and even SQL DB corruptions. I tried doing WSUS resets, even recreating the database (this is a last resort in a large environment). I then thought maybe it was a Server 2012 WSUS issue as we had other Server 2012 related cases open with Microsoft. To test I rebuilt one server as 2012R2, but the problems persisted. Given it was only happening on a couple of server I assumed it was an issue with those servers in particular and didn’t suspect a larger issue.

Over the Christmas holidays things went quite, so there was nothing more I could do until school returned the following February.

Then everything basically exploded.

The first patch cycle we ran saw the WSUS server rocket to 100% CPU and stay there. Nothing I did could stop this reoccurring. I found ways to bring things under control for a few hours at a time. Endpoint definitions started falling behind because clients couldn’t scan for updates. Then it started happening on a couple more of the servers. At this point I conceded defeat and called in Microsoft. Unfortunately it was another 6 months before they finally identified it was a “function” of WSUS causing the grief and not the configuration or size of our environment.

The Problem

The most obvious symptoms will be clients failing to scan for updates and the WSUS server CPU (w3wp.exe) going very high. Some clients get through, many will fail. The main cause will be Windows 10 clients and the way WSUS has to process the Cumulative Updates.

When a client performs a scan, WSUS will generate an XML response to the client with the update metadata. This will vary in size depending on what products and categories you are syncing in WSUS and what updates you have that are not declined or expired.

When WSUS generates the XML for a Windows 10 device it may trigger a state causing the WSUS IIS worker process to consume all available CPU which can lead to other issues on the WSUS server. Given time (minutes) it may complete the request and continue, other times it just never finishes.

Configuration Manager clients have a “special” feature that plain WSUS clients don’t have. When a request to scan fails they will retry 4 times before waiting to try again the next day. WSUS clients don’t retry. This has the effect of amplifying the issues as clients hit the server multiple times.

Symptoms

These are based on my own observations and have proven a reliable way to see the issue starting to happen. I now have a live perfmon windows connecting to each WSUS server watching the counters described below so I can identify the problem starting almost immediately.

  • CPU on WSUS server will hit 100% (if no IIS throttling enabled)
  • Clients fail when connecting to WSUS to perform update scan
    • Clients report timeout in windowsupdate log/events
    • ConfigMgr causes clients to retry 4 times when an error occurs which will increase the load on the system and make it worse
  • Built-in ConfigMgr report for last scan states will show increasing number of clients failing or pending retry while “Completed” count drops
    • Software Updates – D Scan > Scan 2 – Last scan states by site
    • lastscanstate
  • WSUS AppPool worker process (w3wp.exe) shows very high CPU usage
    • Specifically related to the Clientwebservice web app
  • Perfmon
    • Counter: Processor > _Total > %Processor Time
      • Shows high CPU usage
    • Counter: Web Service > WSUS Administration > Current Anonymous Users
      • Indications of internal thread processing by WSUS
      • A very high (>50) for extended periods or rapidly increasing value indicates internal processing is not completing
    • Counter: Web Services > WSUS Administration > Current Connections
      • Shows total external clients making connections to WSUS
    • Current Anonymous Users shows a “sawtooth” pattern as the count climbs and is reset by the AppPool hitting limits and recycling. Note: Current Anonymous Users refers to internal processes within the IIS WSUS apps and is not related to authentication of incoming client connections
    • “Normal” processing will rarely show Average Current Anonymous Users greater than single digits
      • currentanonymoususers
        • Blue – IIS current connections (set to limit to 1200 here)
        • Brown – CPU with throttling enabled at 70%
        • Black – Current Anonymous Users climbing to nearly 600 at peak)
    • Counter: WSUS: Client Web Service > spgetcorexml > Average execution time
    • Counter: WSUS: Client Web Service > spgetcorexml > Cache hit ratio
    • Counter: WSUS: Client Web Service > spgetcorexml > Cache size
    • “Bad Server”
      • cachehit-bad
      • Cache Size (black) will vary widely as the system attempts to process the client requests to generate XML responses
      • Cache hit Ratio (blue) also varies wildly as the system is unable to find optimisations
      • Average Execution Time varies or shows constantly high values
    • “Good Server”
    • cachehit-good
      • Cache Size (black) will reach optimum level and remain fairly consistent
      • Cache Hit Ratio (blue) will be near 1.00 showing very good optimisation when processing client requests
      • Average Execution time will be near 0
  • Problem related to cumulative updates
    • WSUS server is generating an XML file to reply to client scan requests and generating that file is creating massive load on the worker process
    • In some cases leaving it to run high will eventually resolve, more often it never stopped until the process was terminated

Remediation

NOTE: These suggestions will not fix or prevent the actual issue, but might help keep things running until clients catch up and settle down again. Results will vary depending on the size of your environment

  • Throttle worker process (Do this first to get CPU under control so server will be responsive again
    • Create new Application Pool (Copy settings on current WsusPool)
    • Name: ClientWebService
    • Queue Length: 2000
    • Identity: NetworkService
    • Limit: Server 2012=70000, Server 2012R2=70
    • Maximum Worker Processes: 2
    • Assign the WSUS ClientWebService web app to the new AppPool
      • clientwebservice-workerprocess
    • Restart IIS to give everything a fresh start
    • In theory, if a client request comes in that triggers a worker process to be consumed, the remaining worker process(es) can continue servicing new client requests
    • When a “bad” request is being processed you will see one (or more) of the w3wp.exe processes using CPU. If process remains high for a long time terminate w3wp.exe process and it will recycle while leaving the remaining worker process running
    • clientwebservice-workerprocess2.jpg
      • Current Worker Processes can be seen under IIS > Server node > Worker Processes (showing 4 worker processes for ClientWebService here)
    • Task Manager will show w3wp.exe processes and Process ID’s can be matched against IIS
      • clientwebservice-workerprocess3
  • Private memory limit
    • Set to 0 for unlimited, or as large as you are comfortable with if server is shared
    • Minimum of 8192000 (8GB) is recommended but adjust to suit environment
  • Add additional Server memory and CPU’s
    • In a virtual environment this will be easier and may help
  • Clean up WSUS
    • While this did not make any difference for us in this case, it is still good hygiene 🙂
    • Decline previous cumulative updates that are superseded by more recently approved ones
    • Decline any superseded updates where a newer update is approved
    • Don’t sync updates in WSUS for products you don’t need (reduce the overall size of the update metadata catalogue)
    • Check only the language(s) required have been enabled on the WSUS server
    • Check this post, and the articles it references for additional clean up activityRemove obsolete/old updates
  • Reduce ConfigMgr client update scan cycles. Remember to restore these once everything is working again
    • This is to reduce the number of “hits” the WSUS server will need to deal with and help reduce retry backlogs
    • Reduce the “check for definition updates” client settings (2 days)
    • Reduce the check for updates agent schedule (7 days)
    • Reduce the updates reevaluation agent schedule (7 days)
  • Set IIS limits to 50 (this is an arbitrarily low number and not chosen for any specific reason other than it seems to work most times)
    • iislimits
    • Slowly increase in increments of 25 a few minutes after CPU returns to normal
    • Each time the value is increased it will trigger a recycle of the AppPool
    • This can help to get back up to full function, but it only takes a couple of machines making catalogue requests to break it again
    • During the period of IIS connections being limited, clients will receive a message from the IIS server saying it is unavailable still causing clients to retry, however the IIS/WSUS server itself is spared needing to attempt processing as many clients connections

There is a hotfix from Microsoft on the way that will resolve this. Currently estimated to be released late September if all goes well.

Comments
  1. Hi! I got the same issue as you described in my environment(>5000 clients and 1 SUP (WSUS) starting from January. I have reinstalled WSUS and resynced it, but no results.
    I have managed client connections manually using temporary Deny firewall rules, this allowed to keep service working somehow, but now it not help. Also I tried to edit IIS connection limits and increase Requests limits in ClientWebService web.config, no result. w3wp process uses all 70%(throttled) in 3-4 from iisreset executed.

  2. Superb article, thank you.

  3. Kevin says:

    I don’t pretend to know all that is involved in creating a hotfix but late September is a bit long to wait for something that is having such a huge impact. Thank you though for the info above!

    • Scott says:

      Simply a timing issue. I got the engineering hotfix but had to wait for the next update cycle to test if it worked. It turn went to MS for more packaging and testing to be ready for release into the soonest release cycle. This hotfix is being given very high priority and is being moved through the system much faster than any hotfixes I’ve worked on testing with them before.

  4. Shane Curtis says:

    We were experiencing this same issue last week. We opened a ticket with Microsoft and tried all kinds of things (playing with the app pool, doing maintenance on the SUSDB, etc.). The thing that seemed to make the difference was when we were assigned a new engineer and he said that WSUS is having a problem with superseded Windows 10 cumulative updates. We opened the WSUS console and declined all of the superseded Windows 10 cumulative updates and the problem went away. We also upgraded .NET framework to 4.7 but I’m not convinced that did it. I thought the engineer said this was just a problem on Windows Server 2008 R2 (customer is still running that as the OS on the primary site server) but it sounds like you are having this problem on a Windows Server 2012 machine also.

    • Scott says:

      Doing cleanup helps to reduce the load, and in my experience it does pass after clients that have been able to connect stop adding to the load. The problem will likely pass until next cycle.
      You do need to upgrade your 2008R2 server though. I would be very surprised if they release an update for it to fix the original issue I was investigating. The August update issue is client-side so an update for that will go out for machines, which may be the problem you are having if it only just started recently for you

  5. John Panicci says:

    Scott, I’m currently using an Eval version of SCCM, I needed to get environment up and running. I had no issues when I started out, with a smaller group of Workstations and 80 Servers (2012 R2 mostly) , once I rolled out to 300 workstations, The problem you describe above reared its ugly head. If I call MS will they give me the hotfix?
    Thanks
    John

    • Scott says:

      This problem isn’t related to ConfigMgr itself, it’s problem with the WSUS service, or the August client updates (depending on when this started for you)
      No, I doubt they will give you the hotfix as it is already in the internal testing phase and will most likely be released before your support call even got close to reaching the stage they would give it out.
      For environment your size, try the suggestions in my post and it should help to keep things under control until then.

  6. AP says:

    Outstanding article, we discovered this issue and have been attempting to resolve. How certain are you that MS won’t release immediately? Is there any value in us chasing our MS rep to get this patch earlier? This article needs more attention. Thanks again.

    • Scott says:

      At this stage Microsoft will already have the patches in the testing phase to get into the next patch release cycle next month if things go well. Given the impact they would probably release out of band if they could.
      You can certainly raise it with your rep to make sure Microsoft is aware of the impact this has had, but I wouldn’t see any point trying to apply pressure to get the hotfix as it will most likely be available before any escalation could get it to you anyway… assuming you have the ability to put that much pressure on MS in the first place 🙂

  7. Ned says:

    it also affects WSUS on SBS 2011 – WIndows 10 anniversary edition clients cant get updates with same issues and errors.

  8. Nikolay Hristozov says:

    When the patch is ready, do you know on what systems should it be deployed to stop the issue – on the clients or on the WSUS servers, because having in mind that our main WSUS is crap at the moment, the clients wouldn`t be able to obtain it.

    • Scott says:

      There are two issues. The main one I was discussing in this post will be a patch on the WSUS server itself.
      The more recent issue is due to the August update deployed to workstations, and that should be resolved by the next update cycle.
      Hopefully the WSUS server patch will also help improve clients being able to connect to get their updates as well.

  9. AP says:

    Sadly kb4039396 is not available for SCCM to deploy yet. No way I’m manually installing this. Anyone else not seeing this show up? I have applied the server side patch and things have returned to normal for the most part.

    • Scott says:

      Both updates are in “preview” mode which is why they aren’t in WSUS yet. I don’t know the release cycle they have planned, but the server side update should help in both cases anyway, and it really depends on how hard these issues affect you as to which remedies you need.
      The server side update release before the next patching cycle is very welcome though.

  10. Cesar says:

    Dude! You were a great helper.
    Thank you so much

Leave a comment