Review Board 1.7.16

Fix crashes in ast_rtcp_write()

Review Request #1444 - Created Sept. 18, 2011 and submitted

Russell Bryant
I opened ASTERISK-18570 for this issue.  The rest of the issues are ones that I have found so far that appear to be the same problem.

This patch addresses crashes related to RTCP handling.  The backtraces just show a crash in ast_rtcp_write() where it appears that the RTP instance is no longer valid.  There is a race condition with scheduled RTCP transmissions and the destruction of the RTP instance.  This patch utilizes the fact that ast_rtp_instance is a reference counted object and ensures that it will not get destroyed while a reference is still around due to scheduled RTCP transmissions.

RTCP transmissions are scheduled and executed from the chan_sip scheduler context.  This scheduler context is processed in the SIP monitor thread.  The destruction of an RTP instance occurs when the associated sip_pvt gets destroyed (which happens when the sip_pvt reference count reaches 0).  However, the SIP monitor thread is not the only thread that can cause a sip_pvt to get destroyed.  The sip_hangup function, executed from a channel thread, also decrements the reference count on a sip_pvt and could cause it to get destroyed.

While this is being changed anyway, the patch also removes calling ast_sched_del() from within the RTCP scheduler callback.  It's not helpful.  Simply returning 0 prevents the callback from being rescheduled.
This patch has been applied to a test environment with a couple of servers running Asterisk  The servers have processed over 1 million calls without hitting this crash.
Posted (Sept. 18, 2011, 7:01 p.m.)


/branches/1.8/res/res_rtp_asterisk.c (Diff revision 1)
Why are we checking schedid == -1 here.   This implies that the thread removing the scheduled item is not the same as the thread executing the scheduled event.  If this is the case, how are we guaranteed schedid won't get set to -1 after this check occurs while still in the callback?   Would this cause a ref leak or anything?
  1. Excellent observations.  You are correct that it is not guaranteed that the thread canceling the scheduled item is not the same one executing the scheduler.  Specifically, the SIP monitor thread is executing the scheduler and it is usually a channel thread canceling the scheduled item.  (sip_hangup -> stop_media_flows -> ast_rtp_instance_stop).
    There will not be a ref leak.  The reference is only released when either the scheduled item is successfully canceled or if the scheduler callback is returning 0 indicating that it is not going to run again.
    The other question, how are we guaranteed schedid won't get set to -1 after the check occurs, is more complicated.  The direct answer is we're not guaranteed.  The next question would be, is it harmful that there is no synchronization here?  I went back and forth with myself a bit on whether or not locking should be added here.  I obviously didn't want to add it unless it was absolutely necessary and ended up going without it.
    If schedid gets set to -1 (because the RTP instance got stopped) while this function is running, it's just going to do its thing like usual and transmit one last RTCP packet.  That doesn't seem harmful.  It's just going to catch the -1 the next time it runs and then bail out and unref the RTP instance.  So, it all works out in the end.  It's a little less pretty, but at least no more locks get added.
    As I wrote this out, the one down side I thought of is that if this edge case gets hit, it would delay the destruction of the RTP instance until the next RTCP packet is scheduled to be sent.  By default, this is every 5 seconds, but it's configurable.  I'm ok with that personally ...
    Now, for the lulz, here is some code in the handling of the rtcpinterval configuraiton option in res_rtp_asterisk.c:
                            if (rtcpinterval == 0)
                                    rtcpinterval = 0; /* Just so we're clear... it's zero */
/branches/1.8/res/res_rtp_asterisk.c (Diff revision 1)
There are macros for handling the ref counting of stuff associated with a scheduled event.
  1. Yeah ... I avoided it because AST_SCHED_DEL_UNREF() is even more evil than AST_SCHED_DEL().  It does the "loop 10 times" thing and also makes some assumptions about how the sched ID variable is handled and when it gets set to -1.  I would rather either leave it this way or start writing a new set of helper macros that are less evil.
  2. Best way to make it less evil is to change the assumption in the scheduler code that when a callback is running, it's removed from the list.  It's a basic assumption that has led to the number of scheduler macros that work on a principle of "scheduler jobs shouldn't take more than 1 second to run, so sometime in this sequence, it should eventually cancel the task".  It's probably past time to break that assumption in the scheduler
    code and get rid of the race condition between a running task that always reschedules itself and another thread trying to cancel that task.
Ship it!
Posted (Sept. 19, 2011, 5:06 a.m.)
A 5 second delayed destruction isn't that big of a deal.  I'm fine with this and completely agree that introducing any sort of synchronization here would just make this more complicated than is necessary.  Great Work! runs on a server provided by Digium, Inc. and uses bandwidth donated to the open source Asterisk community by API Digital Communications in Huntsville, AL USA.
Please report problems with this site to