Tuesday, May 24, 2016

BM Business Process Manager Event Manager - Common symptoms and Solutions


Correct definition sample:


With the incorrect definition sample, the Process Admin Console shows a NullPointerException when you click Event Manager > Monitor as shown in the following screen shot:
To solve that problem, correct the 100Custom.xml file and restart the server. After the restart, make sure that the TeamWorksConfiguration.running.xml file contains the complete section and the event manager is shown as active in the Process Admin Console.

Note: There is a list of sample configuration files to adapt the IBM Business Process Manager configuration, including some samples for the event manager. You can access these files here.
b) The event manager might have been paused manually by using the Process Admin Console. In this case, you can resume its activity as mentioned previously. Even when it is paused, the connect expiration time stamp is renewed every 15s (default).
c) The event manager is configured to be started as "paused"
The 80EventManager.xml BPM server configuration file contains a parameter called , which is set to false, by default. If it is configured to true by overwriting the parameter with a 100Custom.xml file, then the heartbeat thread to set the event manager 'connect expiration' is active, but the event manager will not process any work.
To check for that situation, look at your TeamWorksConfiguration.running.xml BPM server configuration file. Search in that file for the string in the section. If that parameter is set to true, this setting explains why the event manager did not become active after server start up.

In case the event manager is configured to be started as "paused," the SystemOut.log file will only contain the following messages during start up. They show that the heartbeat thread started and continuously updates the connect expiration time stamp, but the event manager did not acquire the synchronous queues.

wle_scheduler I   CWLLG0570I: Heartbeat paused.
wle_scheduler I   CWLLG0561I: Heartbeat thread starting...
wle_scheduler I   CWLLG0615I: Heartbeat resumed.

To resume the event manager, use the Process Admin Console as shown previously. A successful resume action will result in the following messages in the SystemOut.log file:

wle_scheduler I   CWLLG0615I: Heartbeat resumed.
wle_scheduler I   CWLLG0597I: Trying to acquire synchronous queue SYNC_QUEUE_1.
wle_scheduler I   CWLLG0581I: Acquired synchronous queue SYNC_QUEUE_1.
wle_scheduler I   CWLLG0597I: Trying to acquire synchronous queue SYNC_QUEUE_2.
wle_scheduler I   CWLLG0581I: Acquired synchronous queue SYNC_QUEUE_2.
wle_scheduler I   CWLLG0597I: Trying to acquire synchronous queue SYNC_QUEUE_3.
wle_scheduler I   CWLLG0581I: Acquired synchronous queue SYNC_QUEUE_3.

Keep in mind that the parameter in the event manager configuration needs to be set back to false. Otherwise, it will still be inactive after the next server restart.

d) Event manager is not enabled in the configuration.
The 80EventManager.xml BPM server configuration file contains a parameter called enabled, which is set to true, by default.
To check for that situation, look at your TeamWorksConfiguration.running.xml BPM server configuration file. Search in that file for the parameter in the section . If that parameter is set to false, then the event manager will not be active after the server start up. In contrast to being started as "paused," it will show none of the previous messages in the SystemOut.log file and the connect expiration time stamp will not be updated!

You cannot resume the event manager through the Process Admin Console in such a case. However, you need to change your configuration to set the enabled parameter back to true and restart your server.

e) Blackout period is active
Administrators establish blackout periods to specify times when events cannot be scheduled. For example, you might schedule a blackout period due to a holiday or for regular system maintenance windows. The event manager takes blackout periods into account when scheduling and queuing events, event subscriptions, and undercover agents (UCAs). The following screenshot shows if and which blackout periods are configured. This data is persisted in the LSW_BLACKOUT_CALENDAR DB table.image
If a blackout period is active, the event manager monitor in the Process Admin Console lists a scheduled job named End blackout period where the scheduled time column shows when the blackout period ends. Event manager jobs created during the blackout period show a job status of Blacked out.

The following screenshot shows that scenario:image
The SystemOut.log file does not show any applicable messages when the blackout period is entered.

f) Exceptions during the event manager start up
If the event manager is not running after start up, but was not configured as paused or disabled, the SystemOut.log file might show a couple of exceptions.

There could be various reasons why the event manager failed during startup or resume. Gather the documents as mentioned in the event manager mustgather technote.
The following section shows a few examples:
  • Event manager configuration is broken

    This problem is caused by an incomplete fix pack installation where required post-installation steps to upgrade the profile were not executed.

    The SystemOut.log file will have exceptions that have the following signatures.

    CWLLG0144E: Exception in init(): schedule cannot be started. com.lombardisoftware.core.TeamWorksException: Message: SCHEDULER_CONFIG_BROKEN Arguments: loader-acquire-sync-queue-query: com.lombardisoftware.core.config.eventmanager.SchedulerConfig checkAndReplace Message: SCHEDULER_CONFIG_REPLACEMENT_PARAMETER_NOT_FOUND Arguments: %executing% loader-acquire-tasks-query UPDATE LSW_EM_TASK SET TASK_STATUS = %acquired%, TASK_OWNER = ? WHERE TASK_ID IN (%task-ids%)

    To fix this problem, review the documented post-installation (interim fix/fix pack) steps and rerun the missing steps.
  • Event manager start up problem due to a problem in the BPM embedded document store (applies to IBM Business Process Manager V8.5 and later))

    Important note: If the embedded BPM document store cannot be started due to configuration or authorization problems, the event manager will also not start!

    The SystemOut.log file will not show any of the event manager-related start up messages as shown previously, but you will see, for example, the following exception, which is related to the embedded document store:

    CWTDS1100E: An error occurred while validating or creating the default configuration for the IBM BPM document store.
                                     com.ibm.bpm.embeddedecm.exception.UserMissesWritePermissionException: CWTDS0022E: The configuration was changed in a way that the technical user 'deadmin' of the IBM BPM document store fails to change the object 'Domain'.
    Explanation: The technical user defined in the BPM role type 'EmbeddedECMTechnicalUser' is not permitted to perform changes on an object.
    Action: Revert the recent configuration changes. Ensure that the user defined by the BPM role type 'EmbeddedECMTechnicalUser' has access to the object. Verify this using the admin task 'getDocumentStoreStatus'.
        at com.ibm.bpm.embeddedecm.internal.DomainConfiguration$2.run(DomainConfiguration.java:264)
        at com.ibm.bpm.embeddedecm.internal.DomainConfiguration$2.run(DomainConfiguration.java:207)
        at java.security.AccessController.doPrivileged(AccessController.java:362)

    To fix that problem, the configuration error with the document store must be resolved as shown here: http://www.ibm.com/support/docview.wss?uid=swg21673250

A.2 - Event manager is active but it is not processing any jobs

When the event manager is active (Process Admin Console shows it as active and connect expiration is not outdated) but is not processing any tasks, this could be caused by:
  • Event manager configuration file '80EventManager.xml', respectively the global BPM server configuration file TeamWorksConfiguration.running.xml, which contains all of the parameters at run time. The following technote will show where to find these files and how they relate: http://www.ibm.com/support/docview.wss?uid=swg21439614
  • Event manager blocked due to orphaned transactions in Microsoft SQLServer holding locks on its tables:
    In case you use Microsoft SQLServer as the process server database, the reason for that could be so called 'orphaned transactions' in the DB system. The following TechNote will show how to resolve such a problem: http://www.ibm.com/support/docview.wss?uid=swg21633692
  • System time or timezone of BPM and remote DB system which is hosting the BPM DB is out of sync:To fix that, please make sure, that the system time on the BPM and the DB node are in sync. It is a best practice to have both on the same network time protocol server (NTP).

B - Event manager shows jobs with a scheduled date of 2099

If the execution of an event manager job fails, it is retried a couple of times as defined by the  re-execute-limit configuration parameter (default = 5) in the  80EventManager.xml file. The behavior in such a case has gone through a fundamental change with APAR JR47860:
  • Pre JR47860 behaviour: when the re-execute-limit is reached, the according event manager job is discarded! There is no way to re-execute this job.
  • Post JR47860 behaviour: when the re-execute-limit is reached, the event manager job is rescheduled for 2099.
The interim fix for the APAR also provides a new administrative command called BPMReplayOnHoldEMTasks, which was introduced to resubmit this failed job. Check the APAR description for more details or the see the product documentation information in the IBM Knowledge Center.

Important note: Before resubmitting an event manager job, it is important to eliminate the root cause! Otherwise, you might run into the same problem again. To find the root cause, check your SystemOut.log file for message CWLLG0197W. This message indicates, that the event manager has tried to execute a task for 5 times but it failed. Note the thread ID and walk back in the thread history within the SystemOut.log file, which will most probably tell you which exception the execution of this event manager task failed.

Example for an event manager task to execute an UCA:
1. Search the SystemOut.log file for CWLLG0197W shows the following line - note thread ID 00011779.
[2/4/14 5:54:18:395 GMT] 00011779 wle_ucaexcept E   CWLLG0197W: Task Notify BPD 202738 of notification failed 5  times.  The task will not be re-executed.

The previous messages for thread 00011779 will show this error message:
[2/4/14 5:54:18:337 GMT] 00011779 wle_ucaexcept E   CWLLG0181E: An exception occurred during execution of task 4,425,203.  Error: PreparedStatementCallback; SQL [update LSW_BPD_INSTANCE_DATA set DATA = ? where BPD_INSTANCE_ID = ?]; Error for batch element #1: DB2 SQL Error: SQLCODE=-1476, SQLSTATE=40506, SQLERRMC=-968, DRIVER=3.61.65; nested exception is com.ibm.db2.jcc.am.SqlTransactionRollbackException: Error for batch element #1: DB2 SQL Error: SQLCODE=-1476, SQLSTATE=40506, SQLERRMC=-968, DRIVER=3.61.65
com.lombardisoftware.core.TeamWorksException: PreparedStatementCallback; SQL [update LSW_BPD_INSTANCE_DATA set DATA = ? where BPD_INSTANCE_ID = ?]; Error for batch         element #1: DB2 SQL Error: SQLCODE=-1476, SQLSTATE=40506, SQLERRMC=-968, DRIVER=3.61.65; nested exception is com.ibm.db2.jcc.am.SqlTransactionRollbackException: Error for batch element #1: DB2 SQL Error: SQLCODE=-1476, SQLSTATE=40506, SQLERRMC=-968, DRIVER=3.61.65
    at com.lombardisoftware.core.TeamWorksException.asTeamWorksException(TeamWorksException.java:130 ...

In this special case, the execution of the event manager task failed due to an SQL exception with sqlcode -968, which means that the database filesystem is out of space.

2. Fix the problem that caused the exception. In the previous example, resolve the out-of-space condition in the database filesystem.

3. Resubmit the applicable event manager task by using the BPMReplayOnHoldEMTasks command.

C,D,E,F - Event manager is active, but throughput problems exist

Throughput problems might be caused by a wide range of reasons. In terms of the event manager, the potential throughput is limited by the capacity of its queues.

For a comprehensive summary of all event manager-related configuration parameters including the different queues, check this product documentation in the IBM Knowledge Center.
To analyze and fix this problem, you need to understand the involved configuration parameters and how to monitor and adapt them.
a) Find out the event manager queue capacities
The event manager maintains a number of internal queues. The capacity of each queue is limited by a configuration parameter that is specified in the 80EventManager.xml configuration file and limits the number of jobs that can be in the execution state simultaneously. The following table shows the different queues, the applicable configuration parameter, and the default capacity (as of IBM Business Process Manager 8.5.5):
Event Manager Queue Configuration Parameter in 80EventManager.xml Default capacity
Async Queue(UCA) async-queue-capacity 10
Sync Queue (UCA) sync-queue-capacity 10
BPD Async Queue
- BPD Notification
- system lane tasks
- timer execution
bpd-queue-capacity 40
System Queue system-queue-capacity 10
The default values could have been overwritten by using a 100Custom.xml file. Then, find out which values are currently being used and have a look into TeamWorksConfiguration.running.xml file.
b) Determine the event manager queue usage and adapt the event manager queue sizes
To monitor the number of executing jobs on each event manager queue, use the Process Admin Console event manager monitor and count the number of rows for each 'Job Queue' with job status 'Executing'. Alternatively you could use this SQL statement:

when '-100' then 'UCA Async Queue'
when '-101' then 'BPD Async Queue'
when '-102' then 'EM System Queue'
else 'UCA Sync Queue' END as QUEUE
from LSW_EM_TASK where TASK_STATUS = 3 group by QUEUE_ID WITH UR;

If the number of executing event manager tasks for a queue has reached the capacity limit and there are more tasks on that queue waiting to be executed (time to be scheduled has already passed), then there might be a performance problem or the queue capacity is too low for the workload and needs to be increased.

The BPD async queue is of special interest because its capacity is shared between the execution of system lane tasks, timer executions, and BPD notifications. If the complete capacity is already occupied by currently executing, long-running system lane tasks, no other job can be executed on that queue. The screen shot shown previously for Symptom C is an example from a system with bpd-queue-capacity set to 5 and the complete capacity is occupied by five executing system tasks. To eliminate a problem related to long running system tasks:
  1. Find out why the system lane tasks have such a long execution time and try to fix that. There might be various reasons like back-end response time, excessive JVM garbage collection, CPU and memory constraints, network delays, and so on.
  2. If the system lane tasks are expected to be long-running, think about splitting them into smaller pieces or increase the capacity of the BPD async queue as shown in the next paragraph.

c) Increase the event manager queue capacities

To increase the event manager queue sizes, specify the applicable parameter as shown in the previous table in a 100Custom.xml file. For example:

      nc-queue-capacity merge="replace">10</bpd-queue-capacity>
      c-queue-capacity merge="replace">10</bpd-queue-capacity>
      tem-queue-capacity merge="replace">10</bpd-queue-capacity>


Important: When increasing the capacity of event manager queues, keep in mind that, besides using additional threads in your JVM, also additional JDBC connections will be needed. Thus, the JDBC data source (jdbc/TeamworksDB) connection pool also needs to be increased.
As a general rule, increase the number of database connections by two times the value by which you increased the queue capacity. Apart from database connections, also more JVM heap size is needed.

If there is a mismatch between the queue capacities and the number of available connections for the data source, IBM Business Process Manager tries to scale down the queue size. That issue will be indicated in the log by the warning messages shown under Symptom F.

d) Understand the event manager queue capacity and the related thread pool size
The event manager configuration also shows a parameter named max-thread-pool-size. By default, the value for the  max-thread-pool-size parameter is the sum of the individual queue capacities (70). It is important to understand that its size does not limit the overall number of event manager tasks that can be executed simultaneously. So even if you set max-thread-pool-size to 5 and bpd-queue-capacity to 10, you will be able to execute 10 system lane tasks simultaneously. It is possible because the threadpool is defined as 'growable', which means it temporarily allows the number of threads to exceed the defined limit, but such a thread would be discarded directly after it finished and not be returned to the pool. Therefore, these threads are a bit more expensive.

Starting in IBM Business Process Manager, the event manager no longer uses its own internal thread pool. Instead, it uses a WebSphere Application Server work manager thread pool. This function is configured by these two event manager parameters:
  • -was-work-manager>true</use-was-work-manager>
  • -work-manager>wm/BPMEventManagerWorkManager</was-work-manager>

When you use the WebSphere Application Server work manager thread pool, the maximum pool size is configured in the WebSphere Application Server Administrative Console as shown in the following screen shot:
In a default configuration, the work manager thread pool for the event manager is defined with a maximum of 70 threads, but also as 'growable'. When sizing the work manager thread pool for the event manager, also make sure that its size is at least equal to the sum of queue capacities.
In case you modified the thread pool properties and removed the checkbox for "Growable", then the maximum number of threads implicity also limits the number of event manager jobs that can be executed simultaneously! See this screen shot.

One of the advantages of using a WebSphere Application Server work manager thread pool for the event manager is that you can use the Tivoli Performance Viewer. It is available from the WebSphere Application Server Administrative Console to monitor the thread pool activity. See this screen shot:

G - Event manager tasks fail when the LombardiEventEmitterInputQueue reaches the maximum threshold

Check the current queue depth and the threshold. You can use the service integration bus browser, which is integrated into the WebSphere Application Server Administrative Console, to easily check the queue depth and the high message threshold.
Make sure that there is a message consumer active to read from that queue, which would typically be the Business Monitor infrastructure. If it is not started, start the message consumer and the queue depth should decrease.
If the Business Monitor environment is started and consuming messages, but the queue depth is still at the limit, then perform tuning actions for the Business Monitor server or increase the high message threshold for the involved queues.
In case the Business Monitor environment is no longer existent, but you did not revert the BPM server configuration, there are still messages generated and put to the queue, but not consumed!
To solve it, you need to:
  1. Pause the event manager so that no new messages are created.
  2. Manually delete the message on the queue destination as shown below.
  3. Disable the event emission on the IBM Business Process Manager server according to instructions below.
  4. Restart your IBM Business Process Manager server
To delete the message from the queue destination using the WebSphere Application Server Administrative Console, complete these steps:
  1. In the SIBUS section, navigate to the queue point for the LombardiEventEmitterInputQueue.
  2. Select the runtime tab.
  3. Click messages, which will display DeleteAll option to delete all of the messages on that queue.
The event emission for Business Monitor has been explicitly enabled by the following entry. To disable it, set the value for parameter 'enabled' to false as shown here:

Part III - Known APAR related to the event manager

Issue, error, or problem Adressed in APAR Fix included in
Duplicate execution of event manager task under high load when using Oracle DB JR49359, 8.5.5
Change handling of failed event manager tasks and introduction of admin command to replay these failed tasks JR47860, 8.5.5
Posting message to event manager only starts TIP snapshot JR45615 and JR45616,
Blackout calendar not respected for timer events JR45899
Task processing threadpool initialized with wrong user, error message CWLLG0326E or CWLLG0179E JR46484,, 8.5.5
IllegalStateException when starting/stopping teamworks.ear JR47360,, 8.5.5
Delayed communication between BPD and Service engine JR47915,, 8.5.5
DB2 error "bad SQL grammar" with DB2 9.5 after upgrading to or installation of JR46470 JR48878
com.lombardisoftware.core.TeamWorksException: Numeric Overflow on event manager task JR49172 8.5.5,
Double execution of event manager tasks in heavily loaded environments with Oracle DB JR46470
UCA message corrupted when larger than 1000 Bytes and using multibyte characters JR47265,, 8.5.5
UCA input/output parameters corrupted when containing unicode characters JR46993,, 8.5.5
Cleanup of duplicate UCAs entries created before JR41966 had been applied JR47574,,, 8.5.5
Time based UCAs disappear due to incomplete event manager task JR50384
Scheduling a time elapsed UCA task causes exception when Oracle DB is used JR46249
Time elapsed UCA not executed when schedule contains 'FIRST', 'LAST' or multiple weekdays selected JR46122,
Time elapsed UCAs executed at wrong time when DB and process server in different timezone JR43099
Time elapsed UCAs fired multiple times JR41966
CWLLG0181E:  Error: [ssage:com.lombardisoftware.server.scheduler.TaskDeath: Task killed by stopping scheduler at server stop or DB failover JR49523, 8.5.5
Exception when using BPMReplayOnHoldEMTasks command and DB2 on z/OS is used JR50490