Review Board 1.7.16

res_pjsip_session: Fix double re-INVITE collision crash.

Review Request #4414 - Created Feb. 11, 2015 and submitted

A multi-asterisk box setup with direct media enabled would occasionally
crash when two re-INVITE collisions on a call leg happen in a row.

The re-INVITE logic only had one timer struct to defer the re-INVITE.
When the second collision happens the timer struct is overwritten and put
into the timer heap again.  Resources for the first timer are leaked and
the heap has two positions occupied by the same timer struct.  Now the
heap ordering is potentially corrupted, the timer will fire twice, and any
resources allocated for the second timer will be released twice.

* The solution is to put the collided re-INVITE into the delayed requests
queue with all the other delayed requests and cherry pick the next request
that can come off the queue when an event happens.

* Changed to put delayed BYE requests at the head of the delayed queue.
There is no sense in processing delayed UPDATEs and re-INVITEs when a BYE
has been requested.

* Made the start of a BYE request flush the delayed requests queue to
prevent a delayed request from overlaping the BYE transaction.  I saw a
few cases were a delayed re-INVITE got started after the BYE transaction

* Changed the delayed_request struct to use an enum instead of a string
for the request method.  Cherry picking the queue is easier with an enum
than string comparisons and the compiler can warn if a switch statement
does not cover all defined enum values.

* Improved the debug output to give more information.  It helps to know
which channel is involved with an endpoint.  Trunks can have many channels
associated with the endpoint at the same time.
Ran the full testsuite with the patch.  All but one test passes and the failing test fails with or without the patch.

Setup a test environment with two asterisk boxes and direct media enabled for all legs to perform this call repeatedly:
PJSIP/100 --> box 1 --> box 2 --> box 1 --> PJSIP/200

1) 100 calls 200
2) 200 answers
3) 100 hangs up
4) repeat call

A crash no longer happens on either Asterisk box with the patch.
Review request changed
Updated (Feb. 13, 2015, 12:24 p.m.)
  • changed from pending to submitted
Committed in revision 431750 runs on a server provided by Digium, Inc. and uses bandwidth donated to the open source Asterisk community by API Digital Communications in Huntsville, AL USA.
Please report problems with this site to