I recently experienced an issue with a very large number of ConfigMgr 2007 package updates (400) to a large number of sites (1700). It turns out there was already a distribution job that was “stuck” in the queue and when the large update went out it resulted in a massive backlog. The end result being there were over 1.3 million files in the Replication Manager inbox that just weren’t being processed, and the number was increasing.
The only option in this sort of situation is to stop the services, move the files out of the inbox, let normal inbox processing resume and then copy the files back in a block at a time. In this case doing this manually wasn’t an option due to the number of files, so I resorted to a quick script based on the one found here: https://tricksntreats.wordpress.com/2010/05/30/sccm-backlog-fighting/
It’s just a CMD script, and I was thinking of re-writing in VBS or Powershell, but it’s just not something I’ve needed enough to warrant spending the time on.
The original version of the script would just move a set number of files into the inbox at a set time interval. This presents a problem in a large environment as typically in the morning there is a large influx of messages from the remote sites, and dumping in a bunch more on top is enough to overwhelm it and cause the queue to start to backlog again.
My change was to have the script check the inbox every 10 seconds, and if there is less than a defined number (150) of files in the queue, then to copy in a batch from the backlog directory. This way it is enough to keep the replmgr busy, but also automatically backs off so it doesn’t keep dumping new files in there when things are getting busy. I found a minimum of 150 with 10 second check keeps the queue nicely populated and prevents it being overwhelmed, or having moments of no activity.
This script could of course be adapted to trickle copy any type of files as needed.
As per the instructions from the original site
- Create a subdirectory under the inbox folder called “delay”
- Place this script in the “delay” folder
- Place all the backlog files in the delay folder
- Run the script to trickle the files from “delay” back into the parent folder
- You can change the values while the script is running without needing to restart it
@echo off & setlocal EnableDelayedExpansion :start cls ::******************************************************* :: Do not add or remove lines while script is running :: These values can be changed dynamically while running ::maximum number of files in dir already to allow move set max=150 ::limit number of files to copy set LIM=30 ::timeout between checks set TIMEOUT=10 ::******************************************************* set n=0 set count=0 :: COUNT FILES IN INBOX DIR for %%j in (..\*.rp?) do ( set /a count+=1 if !count! geq !max! echo Too many files already in queue - %time% & goto :wait ) :: MOVE FILES echo Found %count% files in queue. Moving files - %time% echo. for %%i in (*.rp?) do ( set /A N+=1 move "%%i" ".." >NUL echo moving file !N! - %%i if !N! geq !LIM! goto :wait ) echo No more files to move - %time% :wait timeout %timeout% goto :start :end
Some observations regarding the replmgr.box files.
And mentioned in the “tricksntreats” blog, RPT files do take longer to process than RPL files. In my case I just moved all the RPT files out into further subfolders, a;b;c;d;e;f etc. I then move blocks of the 10,000 most recent files at a time from those folders into the delay folder for the script to process.
The reason for this is Windows struggles when trying to handle more than 100,000 or so files in a directory, so splitting them out made it much easier to work with.
- RPL files will typically seem to process at a rate of several hundred per second
- RPT files process at around 2 per second. RPT files require additional transaction processing so trigger a process that then generates RPL (or other) files
- When the Replication Manager component runs, it build a list of ALL the files currently in the inbox and work it’s way through them all. Any new files that come in won’t be processed until it has finished the current list. This can mean that as it hits RPT files, there will be many hundreds of new RPL files being generated that won’t be cleared until the current list is complete, even though it would take only a second to clear them
- When Replication Manager runs, it processes the “easy” RPL files first, so clears them out quickly. Keeping the number of files in the folder low means it “restarts” it’s processing list more often and gets those RPL files out of the way
- At some point, the size of the list is such that even when it does finish it, there are now even more files waiting, resulting in a bigger list, and so the issue expands exponentially
- Using this script approach allows normal replication to resume, and slowly allows the backlog to process without letting the replication Manager get overwhelmed again
- Be aware that it can take days to weeks for the backlog to clear out correctly depending on how big it is. Having the newer backlog files process first means you get the latest status messages etc going straight into the database. When an older message is found, it ignores it. If you copy the files from oldest to newest, then it needs to fully process the old record, then sometime later process the newer record