Fix crashes in ast_rtcp_write()
Review Request #1444 - Created Sept. 18, 2011 and submitted
|ASTERISK-13334, ASTERISK-15257, ASTERISK-15406, ASTERISK-17560, ASTERISK-18570, ASTERISK-9716, ASTERISK-9977|
I opened ASTERISK-18570 for this issue. The rest of the issues are ones that I have found so far that appear to be the same problem. This patch addresses crashes related to RTCP handling. The backtraces just show a crash in ast_rtcp_write() where it appears that the RTP instance is no longer valid. There is a race condition with scheduled RTCP transmissions and the destruction of the RTP instance. This patch utilizes the fact that ast_rtp_instance is a reference counted object and ensures that it will not get destroyed while a reference is still around due to scheduled RTCP transmissions. RTCP transmissions are scheduled and executed from the chan_sip scheduler context. This scheduler context is processed in the SIP monitor thread. The destruction of an RTP instance occurs when the associated sip_pvt gets destroyed (which happens when the sip_pvt reference count reaches 0). However, the SIP monitor thread is not the only thread that can cause a sip_pvt to get destroyed. The sip_hangup function, executed from a channel thread, also decrements the reference count on a sip_pvt and could cause it to get destroyed. While this is being changed anyway, the patch also removes calling ast_sched_del() from within the RTCP scheduler callback. It's not helpful. Simply returning 0 prevents the callback from being rescheduled.
This patch has been applied to a test environment with a couple of servers running Asterisk 188.8.131.52-rc1. The servers have processed over 1 million calls without hitting this crash.
Posted (Sept. 18, 2011, 7:01 p.m.)
Why are we checking schedid == -1 here. This implies that the thread removing the scheduled item is not the same as the thread executing the scheduled event. If this is the case, how are we guaranteed schedid won't get set to -1 after this check occurs while still in the callback? Would this cause a ref leak or anything?
There are macros for handling the ref counting of stuff associated with a scheduled event.
Posted (Sept. 19, 2011, 5:06 a.m.)
A 5 second delayed destruction isn't that big of a deal. I'm fine with this and completely agree that introducing any sort of synchronization here would just make this more complicated than is necessary. Great Work!