1% of calls dropped during conversation with customer

All installation and configuration problems and questions

Moderators: Kumba, mflorell, williamconley, enjay, Michael_N, Staydog, gardo, Op3r, gerski, mcargile

1% of calls dropped during conversation with customer

Postby aouyar » Tue Dec 28, 2010 9:47 am

A client had been reporting calls with customers dropped in the middle of conversation since the system was implemented.

Initially we suspected problems with server or the ISDN PRI links, but after implemting some extensive monitoring it seems like the hangups originate from Vicidial:

* At peak hour, only 30 agents are logged on, but the problem occurs even when only 5 agents are logged in. Most of the calls are Manual Dial Next Calls.

* We've defined a new status (CRT) for calls dropped INCALL while the agent is in the middle of a conversation with a customer to investigate every case reported by agents. The term_reason in vicidial_log is CALLER and the hangup_cause in vicidial_carrier_log is 16 for all dropped calls. We've listened to records of each call reported as dropped INCALL to assure that the call was not dispositioned erroneously by the agent and to identify the cases where the call definitely does not seem to be hangup by the client. After monitoring the calls in a period of 4 weeks, it seems like between 0.5 - 1 percent of the calls are being dropped in the middle of the conversation anywhere between 5-90 seconds after the call has been answered by the customer.

* At this point we are sure that the problem is not the server hardware. We've done some testing by running the solution on other servers and the 1% drop ratio was maintained. The servers for the solution are over-powered; both the database server and the telephony server are HP ProLiant DL180 G6 servers with 2 x Quadcore Xeon processors and 6 GB of RAM. We are constantly monitoring the servers using Munin. The load average of the database server is below 1 at peak hour and the highest load registered in the last weeks is 1.56. The load average of the Asterisk Server is 0.3 at peak hor and the highest load registered in the last weeks is 1.10.

* We've not detected any problem in interrupt processing on telephony server. Tests with zttest indicate quite predictable and reliable performance:
Best: 100.000 -- Worst: 99.924 -- Average: 99.995337, Difference: 99.997412

* The telephony server uses Sangoma A102D card with the last firmware and driver update. Astguiclient version 2.2.1, Asterisk version 1.4.21.2-vici, zaptel version 1.4.12.1 and libpri version 1.4.10.1 are used. The Wanpipe interfaces are monitored constantly using custom developed Munin plugins and do not report any errors.

* Finally, we've consulted Sangoma Technical Support on the issue and based on their recommendation, PRI Intense Debug was enabled on PRI Span for one week to detect if origin of the hangups. The calls dispositioned by agents as dropped INCALL were matched with the entries in PRI Intense Debug log, and on checking a sample of 10 manual outbound calls from the last week in all of the cases (10/10) one could observe the following:
1. The term_reason in vicidial_log is CALLER and the hangup_cause in vicidial_carrier_log is 16 for all dropped calls. 2.
2. No hangup request or any other message from the remote end of the PRI link can be observed in the debug log prior to hangup.
3. Just before the hangup, a connection to AMI with user sendcron can be observed.

* I am not an expert in interpretting PRI Intense Debug logs, so I've sent some logs to Sangoma Technical Support to seem if they can confirm my findings, but it seems like the hangups are generated by Vicidial.

* I've checked the action_send log to confirm that the hangup is sent by Vicidial. For every "Disconnect Request" message in the Intense Debug Log for calls dropped INCALL, there is a matching record in action_send log at the same hour / minute / second like the following:

2010-12-27 9:12:38|1|1387918|
Action: Hangup
Channel: Zap/10-1


The following issue was registered in Mantis and some sample PRI Intense Debug logs have been attached:
http://www.vicidial.org/VICIDIALmantis/view.php?id=420

The problem seems to be caused either buy a bug in Vicidial or an erroneous configuration. The problem has a huge impact in the operation of the Call Center and it is a major source of annoyance for the agents, but it is especially difficult to identify and solve, because it only presents in 0.5 - 1 % of the calls. I would be glad if anyone could give an insight on how to pinpoint the root cause of the issue to proceed with a solution.
aouyar
 
Posts: 124
Joined: Fri Jan 30, 2009 12:49 pm

Postby DarknessBBB » Tue Dec 28, 2010 11:14 am

Are you sure that is not a telco-related issue?
DarknessBBB
 
Posts: 314
Joined: Mon Jul 16, 2007 10:14 am

Postby aouyar » Tue Dec 28, 2010 11:36 am

We've been struggling with this problem for more than a month now, and our first reaction was also to suspect a Telco problem, but the client has insisted right from the beginning that with same trunks, from the same Telco, connected to the former Panasonic PBX did not have calls being dropped during conversation.

The client has two ISDN PRI trunks and both exhibit the same problem using Vicidial. The client has made some testing by switching back one of the trunks to the Panasonic PBX for more than a week and the agents did not have any complaints of dropped calls.

We have been monitoring the PRI for a long time and it seems to be totally free of link level errors. It does not exhibit this familiar error where all active calls are dropped at once when there is a link level problem.

It might still be a Telco issue, but then I have not managed to identify anything in the logs that indicate a message being received from the Telco side and neither have I been able to identify any error message in the logs, but on the other hand there are messages in the logs indicating a hangup being sent to Asterisk through AMI with the sendcron user right at the same instant that the call is dropped and to the very same ZAP channel which was managing the active call.

I would really like to know if anyone else has had similar issues and any recommendation which might help to identify which process in Vicidial might be responsible with the hangup is more than welcome.
aouyar
 
Posts: 124
Joined: Fri Jan 30, 2009 12:49 pm

Postby williamconley » Tue Dec 28, 2010 1:37 pm

Honestly: I would hire The Vicidial Group or a seriously deep Asterisk professional (Digium technician) to locate the issue solidly.

Failing that: You'll need debug information for a call that actually experienced this issue.

I would start with agi debugging to see if this was generated by Vicidial in some way, but debugging the actual PRI call is certainly a place to start (obviously if the PRI debug indicates that the call was terminated by your asterisk server, that will lead back to asterisk debugging which should result in Vicidial issuing a command to terminate the call for some reason). But it will be a debug trace rather than a "begin checking all the hardware connections and rebuild the system" solution that resolves it. Actual troubleshooting (as opposed to hip-shooting "it could be ..." scenarios that can result in months of guessing and finger pointing).

How often does this issue occur? If you turn on "massive logging" for an hour will it likely catch one?

Is it the soft phones? (are they using soft phones? I didn't read the details word for word, no time today)
Vicidial Installation - SugarCRM integration - Customization and Add-ons
We Bring It All Together.
www.PoundTeam.com
williamconley
 
Posts: 12552
Joined: Wed Oct 31, 2007 4:17 pm
Location: Auburn, NY (Upstate)

Postby aouyar » Tue Dec 28, 2010 7:54 pm

Hi William, if you check my first post you will see that I already enabled massive logging ;-) on the Asterisk side (core set verbose 10 / pri intense debug span 1).

We've been monitoring the dropped calls for more than a month now and around 1% of the calls seem to get dropped. We've managed to identify the calls in vicidial_log and match them with corresponding entries in the PRI debug. From what I could decipher from the PRI logs the calls do not seem to be dropped by the telco, but by one of the vicidial scripts. One can consistently find a matching hangup command in the vicidial log action_send for each dropped call.

If the problem is really related to a bug in Vicidial it might potentially be affecting other installations, but the problem is quite difficult to trace unless some procedures for tracing dropped calls are implemented, because the issue affects only 1% of the calls and one tends to suspect Telco problems or customer themselves hanging up the calls, because the Termination Cause reported in the vicidial_log is CALLER.

I've managed to get detailed debugging logs from de PRI port, and the logs seem to indicate that the calls are hanged up by Vicidial. What remains to be done is to get more detailed debugging logs from Vicidial to identify why the calls are being hanged up. The big question is why and when ongoing calls might be hanged up by Vicidial.
aouyar
 
Posts: 124
Joined: Fri Jan 30, 2009 12:49 pm

Postby williamconley » Tue Dec 28, 2010 8:24 pm

You say you have massive logging in the beginning ... but then you end with you need debugging.

I think i see the issue.

You'll need to turn on agi debugging (in asterisk) and possibly use the debug or even debugX switches for your vicidial keepalive scripts. This will add a LOT to your logs, so be sure you have the hard drive and processor power available to support this. (Perhaps limit it to non-peak moments)

First find a way to "identify" which calls are affected (if this is possible) so you can trace only specific instances.

And DO post your results if you need help (or even if you don't, if there is a reproducible bug in need of squishing, use the Vicidial Tracker link above)
Vicidial Installation - SugarCRM integration - Customization and Add-ons
We Bring It All Together.
www.PoundTeam.com
williamconley
 
Posts: 12552
Joined: Wed Oct 31, 2007 4:17 pm
Location: Auburn, NY (Upstate)

Postby aouyar » Wed Dec 29, 2010 12:29 pm

Hi William,

I am waiting for a definitive response on the interpretation of the PRI debug trace from Sangoma before proceeding with more detailed debugging on the Vicidial side.

I'll keep you informed of the results,
Bye
aouyar
 
Posts: 124
Joined: Fri Jan 30, 2009 12:49 pm

Postby aouyar » Sat Jan 15, 2011 1:38 pm

After checking some PRI intense debug logs that I had sent, the Sangoma Technical support indicates that the hangups seem to be generated on the Asterisk side, either due to an issue in libpri or an issue higher up in the stack and maybe in the application.

On recommendation from Sangoma Technical Support I will be upgrading libpri to the last available version and sending them some more detailed debug traces.

In the meantime I am trying to figure out what processes in Vicidial might be generating a hangup of in progress calls and my prime suspect is the AST_manager_kill_hung_congested.pl script. To enable logging for this script I checked to code and I discovered that script is supposed to record each hangup request in congest.YYYY-MM-DD log, but the log file never gets created and no log is generated due to a trivial bug in script. I documented the bug and attached a trivial patch for fixing the issue in Mantis:
http://eflo.net/VICIDIALmantis/view.php?id=441 [^]

We will be doing some more testing during the week, and I will be checking the congest.YYYY-MM-DD log to check if the AST_manager_kill_hung_congested.pl script is responsible for the hangups.

Another piece of information that might be relevant to identifying the problem is that most of the calls in the Call Center in question are Manual Dial calls.
aouyar
 
Posts: 124
Joined: Fri Jan 30, 2009 12:49 pm

Postby williamconley » Sat Jan 15, 2011 7:15 pm

manual dial from the asterisk agent gui screen or manual dial from the phone without a web interface?

brilliant post back, by the way, and with a mantis entry no less :)

Way to go!
Vicidial Installation - SugarCRM integration - Customization and Add-ons
We Bring It All Together.
www.PoundTeam.com
williamconley
 
Posts: 12552
Joined: Wed Oct 31, 2007 4:17 pm
Location: Auburn, NY (Upstate)

Postby aouyar » Sun Jan 16, 2011 11:51 am

Most of the calls are either Manual Dial Next or Fast Dial calls from the Agent Interface.
aouyar
 
Posts: 124
Joined: Fri Jan 30, 2009 12:49 pm

Postby williamconley » Sun Jan 16, 2011 12:00 pm

so turn on logging, change the script(s) you are investigating to "--debugX" in the keepalives and/or crons and identify ONE call and nitpick the debug logs and find your answer. :) (then turn logging back off, it takes resources and lots of room)
Vicidial Installation - SugarCRM integration - Customization and Add-ons
We Bring It All Together.
www.PoundTeam.com
williamconley
 
Posts: 12552
Joined: Wed Oct 31, 2007 4:17 pm
Location: Auburn, NY (Upstate)

Postby kelvin » Sat Mar 24, 2012 8:02 pm

what's the result? It's seems that we have the same problem.
kelvin
 
Posts: 15
Joined: Tue Aug 31, 2010 4:27 am


Return to Support

Who is online

Users browsing this forum: Bing [Bot] and 2 guests