Saturday, December 23, 2017

Sometimes being stubborn does not pay off

From the GNU coreutils TODO file:

sort: Investigate better sorting algorithms; see Knuth vol. 3.

We tried list merge sort, but it was about 50% slower than the recursive algorithm currently used by sortlines, and it used more comparisons. We're not sure why this was, as the theory suggests it should do fewer comparisons, so perhaps this should be revisited.

List merge sort was implemented in the style of Knuth algorithm 5.2.4L, with the optimization suggested by exercise 5.2.4-22. The test case was 140,213,394 bytes, 426,4424 lines, text taken from the GCC 3.3 distribution, sort.c compiled with GCC 2.95.4 and running on Debian 3.0r1 GNU/Linux, 2.4GHz Pentium 4, single pass with no temporary files and plenty of RAM.

Since comparisons seem to be the bottleneck, perhaps the best algorithm to try next should be merge insertion.

$ git log -S 'sort: Investigate better sorting algorithms' --source --all
commit 93f9ffc6141ad028223a33e5ab0614aebbd4aa2e refs/tags/CPPI-1_11
Author: Jim Meyering 
Date:   Sat Aug 2 06:27:13 2003 +0000

    Document in TODO Paul's desire to make sort faster (and how he
    was foiled this time around).
    from Paul Eggert.