Review Board 1.7.16


Improve scheduler performance under high load

Review Request #160 - Created Feb. 15, 2009 and submitted

Russell Bryant
/trunk
Reviewers
asterisk-dev
Asterisk
This patch implements the binary max-heap data structure and makes use of it in the Asterisk scheduler.  There are also two new test modules which we were used for development and testing of this code.  The heap test module verifies that the heap is behaving as expected.  The scheduler test module times how long it takes to do adds and deletes from a scheduler context.

More information about the heap data structure can be found here: http://en.wikipedia.org/wiki/Heap_(data_structure)

In the rest of the description and testing discussion, let N be the number of entries in the scheduler.

The data structure that has been replaced in sched.c is a sorted doubly linked list (DLL).  

With the DLL, adding an entry into the scheduler runs in O(N) time.  If the existing entries have an even distribution over time, then the average case is that N/2 element comparisons will be required to find the proper place to insert the new entry.  In the worst case, it's N comparisons.  Removing an element from the scheduler runs in O(1) time, since it simply requires unlinking it from its current position in the list.

With the heap, adding an entry into the scheduler runs in O(log_2 N) time.  Removing an entry is also O(log_2 N).

So, when switching to the heap implementation, you theoretically gain performance on additions, but take a hit on deletes.  But, how much?  See the "Testing Done" section to find out!
Since it may not be immediately clear what the benefits would be of this, testing has been done to measure the performance benefit for different values of N.

Useful things to note:

1) An addition into the scheduler also requires a malloc().  A deletion requires a call to free().  

2) Both operations require a mutex lock() and unlock().

So, the testing done was to figure out:

A) Do the benefits in add() outweigh the penalty in del() ?

B) At what value of N do these changes make enough of an impact to be noticed compared to how much time it takes for the malloc(), free(), and mutex lock()/unlock() calls?

The following tests were done on my primary development machine, which has an Intel Core 2 Duo @ 2.33 GHz.  The results were gathered using the "sched test <N>" CLI command from the test_sched_module.

Test results:

Average running time for N calls to ast_sched_add():

             trunk            team/russell/heap

N=1 million  686.7 seconds    2.7 seconds          
N=100k       53.8 seconds     .278 seconds
N=10k        149.8 ms.        29.9 ms.
N=5k         47.0 ms.         15.0 ms.
N=1k         3.9 ms.          2.9 ms.
N=500        1.6 ms.          1.5 ms.
N=100        0.3 ms.          0.3 ms.

Average running time for N calls to ast_sched_del():

             trunk            team/russell/heap

N=1 million  436.3 ms.        603.0 ms.
N=100k       42.0 ms.         60.3 ms.
N=10k        3.4 ms.          4.4 ms.
N=5k         1.8 ms.          2.2 ms.
N=1k         0.38 ms.         0.44 ms.
N=500        0.19 ms.         0.21 ms.
N=100        0.034 ms.        0.040 ms.

Total average running time for N calls to ast_sched_add() and ast_sched_del():

             trunk            team/russell/heap

N=1 million  686.1 seconds    3.3 seconds
N=100k       53.8 seconds     .338 seconds
N=10k        153.2 ms.        34.2 ms.
N=5k         48.8 ms.         17.2 ms.
N=1k         4.28 ms.         3.34 ms.
N=500        1.79 ms.         1.71 ms.
N=100        0.334 ms.        0.340 ms.

Average running time of team/russell/heap as a percentage of the average running time from trunk:

              Percentage of time
N=1 million   0.4 %
N=100k        0.6 %
N=10k         22.3 %
N=5k          35.2 %
N=1k          78.0 %
N=500         95.5 %
N=100         101.8 %

Analysis of results:

It appears that in this test environment, somewhere near N=100 and below, the code in trunk will perform better.  However, at that point, we're talking about very small fractions of a second.  As N gets larger, the code in team/russell/heap clearly starts to perform much better.

With how the scheduler API is used today, a module such as chan_sip or chan_iax2 could easily have a scheduler context with N in the thousands.  Every registration will result in a scheduler entry.  Also, every active channel will result in at least one, if not more scheduler entries.

Given these results, I do believe that the changes have proven to be beneficial overall.
Review request changed
Updated (Feb. 16, 2009, 9:07 a.m.)
Updated with changes to address Kevin's comments
Ship it!
Posted (Feb. 17, 2009, 8:12 a.m.)

   

  
Posted (Feb. 24, 2009, 9:15 a.m.)
Perhaps this is a little late, but what the heck, I am a little curious....

I'm wondering why you chose the heap as the base for the priority queue... (I see simplicity, perhaps, as a really good reason!) It's not a big deal, but as I said, I'm curious of why you chose the heap.

I experimented last year with rb trees for the scheduler, and I know Tilghman was playing with a O(1) sort of a cross between a hash table and an array based ring sort of approach...

So here is a question:

Is it possible for the heap to degrade to a list in some situations? What is the likelihood of any of these situations arising? (the consequence of not using self-balancing algorithms).

Also, you didn't go into the details of how you tested and compared speeds. Did you use sipp? Did you have a standalone test program? How did you sequence the calls, how long did they last, how many scheduler calls were involved per call, etc?

Also, unrelated to your implementation, but still interesting: in my own perf testing of asterisk, I was noticing that the cpu load near saturation was quite 'wavy' in my sipp tests, where I was looking at very short duration calls (maximal call setup/takedown exercise), and I wondered if it could be due to some sort of 'harmonic' in the scheduler; never had time to see if that was causing it, or what might be done about it. Felt more like a 'beat frequency'... something to think about, anyway.


  1. I chose the heap because it is a very common data structure for priority queues, and it's very easy to implement.  Not only is it easy, but it asymptotically destroys the performance of the list.  You also get some benefit out of the fact that it's not a fully sorted container.  It doesn't need to be.  The only thing that matters is that you know what the highest priority entry is.
    
    The performance testing was by using a simple benchmark test that lives in test_sched.c.  The code is in the diff posted here.
    
    It is not possible for this to degrade to a list.  A heap is a balanced data structure.  After every operation, you still have an essentially complete binary tree.

https://reviewboard.asterisk.org/ runs on a server provided by Digium, Inc. and uses bandwidth donated to the open source Asterisk community by API Digital Communications in Huntsville, AL USA.
Please report problems with this site to asteriskteam@digium.com.