Inspired by a recent forum thread (and John Marcum), I put together a little test to verify if ConfigMgr does indeed automatically retry advertised programs that failed. I created a simple one-line batch file and advertised it on my test client:
exit 999
This one line simply exits the batch file and returns the error code 999 which in this context is meaningless save for the fact that it is not a success code.
The results from execmgr.log on the client pretty much speak for themselves but do in fact verify that ConfigMgr will automatically retry a failed program:
Notice that it first sets the program status to FailureRetry and then WaitingRetry after the failure and that it actually tracks how many times the program has failed. This is important because as we’ll see, the number of times it will retry a program is fixed so that it doesn’t go on trying forever.
And right on queue, 15 minutes (and one second) later, ConfigMgr retires the program — with the same result in this case of course:
Another important thing to notice is this phrase “Non fatal execution error”. This too is very important because it suggests that there are also “Fatal execution errors” and that ConfigMgr treats them differently. Why would it do this? Because you don’t want it to simply retry every failure. If an installation executable or MSI is broken and throwing an error, re-running doesn’t help or change the resulting failure.
Below is the result of another simple script that returns 1 as an exit/error code. As you can see, ConfigMgr set the status to FailureNonRetry; it didn’t explicitly call the failure a “Fatal execution error”, but the implication based on comparing these scenarios is there. And of course, ConfigMgr will not retry this failed program.
So where are things like the retry interval defined and what constitutes a “Non fatal execution error”? In the site control file of course. Here’s a snippet from the sitectrl.ct0 file in my lab (this is default as I have made no changes):
PROPERTY <Execution Failure Retry Count><REG_DWORD><><1008>
PROPERTY <Execution Failure Retry Interval><REG_DWORD><><600>
PROPERTY <Execution Failure Retry Error Codes><REG_SZ><{4,5,8,13,14,39,51,53,54,55,59,64,65,67,70,71,85,86,87,112,128,170,267,999,1003,1203,1219,
1220,1222,1231,1232,1238,1265,1311,1323,1326,1330,1618,1622,2250}><0>
The meaning of these properties is self-explanatory but one small comment to make is about the Retry Interval. Notice that it is defined as 600 [seconds] or 10 minutes but the retry in my example above was 15 minutes. This is because of the (default) 5-minute notification given to users when a mandatory advertisement is about to run: 10 + 5 = 15.
That still leaves the question of why are these specific codes defined as “Non fatal”, and automatically retired, while others are not? The answer reveals itself when you look at the meaning of each of these codes (I’ll leave that as an exercise for you). They all have to do with infrastructure status or configuration and not directly with an internal failure of the command-line being run; they are essentially external to the command-line. Things like “Access is denied” (error code 5), “An attempt was made to establish a session to a network server, but there are already too many sessions established to that server.” (error code 1220), and “Another installation is already in progress. Complete that installation before proceeding with this install.” (error code 1618) are all things outside the control of command-line run and thus may change the next time the command-line is tried. However, errors like “Incorrect function” (error code 1) and “The system cannot find the file specified.” (error code 2) are clearly issues with the command-line itself and no number of times retying it will change the result and thus these are not ever retried automatically.
One last note is that simply removing an advertisement that is in the WaitingRetry status will indeed remove it from the target system (once the system receives the policy update of course) as evidenced by the log file snippet below.