A client had been reporting calls with customers dropped in the middle of conversation since the system was implemented.
Initially we suspected problems with server or the ISDN PRI links, but after implemting some extensive monitoring it seems like the hangups originate from Vicidial:
* At peak hour, only 30 agents are logged on, but the problem occurs even when only 5 agents are logged in. Most of the calls are Manual Dial Next Calls.
* We've defined a new status (CRT) for calls dropped INCALL while the agent is in the middle of a conversation with a customer to investigate every case reported by agents. The term_reason in vicidial_log is CALLER and the hangup_cause in vicidial_carrier_log is 16 for all dropped calls. We've listened to records of each call reported as dropped INCALL to assure that the call was not dispositioned erroneously by the agent and to identify the cases where the call definitely does not seem to be hangup by the client. After monitoring the calls in a period of 4 weeks, it seems like between 0.5 - 1 percent of the calls are being dropped in the middle of the conversation anywhere between 5-90 seconds after the call has been answered by the customer.
* At this point we are sure that the problem is not the server hardware. We've done some testing by running the solution on other servers and the 1% drop ratio was maintained. The servers for the solution are over-powered; both the database server and the telephony server are HP ProLiant DL180 G6 servers with 2 x Quadcore Xeon processors and 6 GB of RAM. We are constantly monitoring the servers using Munin. The load average of the database server is below 1 at peak hour and the highest load registered in the last weeks is 1.56. The load average of the Asterisk Server is 0.3 at peak hor and the highest load registered in the last weeks is 1.10.
* We've not detected any problem in interrupt processing on telephony server. Tests with zttest indicate quite predictable and reliable performance:
Best: 100.000 -- Worst: 99.924 -- Average: 99.995337, Difference: 99.997412
* The telephony server uses Sangoma A102D card with the last firmware and driver update. Astguiclient version 2.2.1, Asterisk version 1.4.21.2-vici, zaptel version 1.4.12.1 and libpri version 1.4.10.1 are used. The Wanpipe interfaces are monitored constantly using custom developed Munin plugins and do not report any errors.
* Finally, we've consulted Sangoma Technical Support on the issue and based on their recommendation, PRI Intense Debug was enabled on PRI Span for one week to detect if origin of the hangups. The calls dispositioned by agents as dropped INCALL were matched with the entries in PRI Intense Debug log, and on checking a sample of 10 manual outbound calls from the last week in all of the cases (10/10) one could observe the following:
1. The term_reason in vicidial_log is CALLER and the hangup_cause in vicidial_carrier_log is 16 for all dropped calls. 2.
2. No hangup request or any other message from the remote end of the PRI link can be observed in the debug log prior to hangup.
3. Just before the hangup, a connection to AMI with user sendcron can be observed.
* I am not an expert in interpretting PRI Intense Debug logs, so I've sent some logs to Sangoma Technical Support to seem if they can confirm my findings, but it seems like the hangups are generated by Vicidial.
* I've checked the action_send log to confirm that the hangup is sent by Vicidial. For every "Disconnect Request" message in the Intense Debug Log for calls dropped INCALL, there is a matching record in action_send log at the same hour / minute / second like the following:
2010-12-27 9:12:38|1|1387918|
Action: Hangup
Channel: Zap/10-1
The following issue was registered in Mantis and some sample PRI Intense Debug logs have been attached:
http://www.vicidial.org/VICIDIALmantis/view.php?id=420
The problem seems to be caused either buy a bug in Vicidial or an erroneous configuration. The problem has a huge impact in the operation of the Call Center and it is a major source of annoyance for the agents, but it is especially difficult to identify and solve, because it only presents in 0.5 - 1 % of the calls. I would be glad if anyone could give an insight on how to pinpoint the root cause of the issue to proceed with a solution.