Saturday, April 9, 2011

Don't Ignore your FDC Probes.

I cannot tell you how many client sites I have been at where they seem to have unexplainable behavior in their WebSphere MQ environments.  The first question I ask them is "Have you looked at your FDC probes?"  Most of the time they do not know what I am talking about.  So I explain, that MQ will tell you when it does not like something in the environment, or is having some type of trouble.  In some of my engagements, the MQ servers are running fine, at least there is no apparent problem, but when I look....I usually find some area where tuning is needed.

What I am referring to is the FDC files that MQ cuts when it has experienced a failure.  On the distributed platforms..ie. Unix, Linux..etc.  these are stored in /var/mqm/errors.  Most of the time when an FDC event happens, the programs, services and channels will keep running, so it it not readily noticeable that a problem has occurred.

As MQ administrators, one of our functions is to monitor these files, trying to detect a pattern, or at least cataloging and quantifying what has happened.  Armed with this information, we can make adjustments to our environments and get them running to full capacity.
So let's look at getting at the probes...and how many of which type occurred.  To do this, we will use some simple Unix/Linux commands.

First off, change to the /var/mqm/errors directory:

$ cd /var/mqm/errors

Now, lets get the probe number, and a count of how many times each has happened:

$ grep "Probe Id" *.FDC|cut -c 38-47 | sort | uniq -c

This is just an example, but your output might look something like this:

      9 HL047028
      4 HL049110

      9 HL077070
     47 XC307070
     43 XC308010


You might have some lines that are offset and duplicates...but that is due to your FDC file name length and this is based on the process ID number. You can add them together for a total occurrence count.

Okay...these "Probe Id's" are coded based on the first two characters as to which MQ component had the failure.  In our case, we have HL and XC.  The HL is the "Logger" component and the XC is "Common Services".  A partial list of these probe codes can be found on my website, http://www.tpmq-experts.com/FDC-Probes.html.

Okay...as we can see, we had 37 XC307070 (Long Lock Wait) probes and 43 XC308010 (Mutex Release) probes.  This is not a catastrophic failure....but does indicate that you are using linear logging and are taking media images at a time when your queues are at a high depth.  This does affect the speed of processing for your applications, so it would be good to have a check in your record image script to check the queue depth....if it is high, then it would be wise to wait to perform the record image.

Looking at the other probes, we had 9 HL047028 (Logger problem), 4 HL049110 (Logger problem) and 9 HL077070 (disk full).  The last probe, is self explanatory...the disc was full.  So if you are using Linear logging...you clean-up script needs to be adjusted, or more disk space allocated.  The other two probes point at kernel parameters msgmnb, msgtql and msgseg..which can all be adjusted to handle the load that your current environment needs.

These are just a few examples of probes that can be corrected, which will have a positive impact on the running of your environment and the throughput of your applications.

Have a look at your probes...and really see what's going on.