# Travis Hance: Suffix Arrays in Linear Time

I think splay trees are really surprising, but I’m not sure if that “counts”, since it’s more of a data structure than an algorithm.

After splay trees, I think the next most surprising algorithm is the O(n) construction of a suffix array of a string. (The suffix array is an ordering of the suffixes of a string over a fixed alphabet in lexicographic order). It’s surprising to me that the suffix array can be computed that fast, plus I think the algorithm is kind of cool. There are a few ways to do it, I think, but I’ll briefly describe the one I learned. The idea is that to find the suffix array of the string x_1,…,x_n, you first recursively compute the suffix arrays of the (2n/3)-length string

<x_1, x_2, x_3>, <x_4, x_5, x_6>, …, <x_{n-2}, x_{n-1}, x_n>, $, <x_2, x_3, x_4>, <x_5, x_6, x_7>, …, <x_{n-1}, x_n, $>

After recursively computing the suffix array of this, we now have sorted the suffixes which start at position i where i = 1 or 2 mod 3. Again by radix sorting, you can then sort suffixes which start at positions 0 mod 3. Now, we have two “partial” suffix arrays, and we just need to merge these two arrays together. To compare a suffix which starts at a 0 mod 3 position with a suffix starting at a 1 mod 3 position, we can we compare the first characters, and if they are equal, we end up needing to compare some suffix starting at a 1 mod 3 position and a suffix starting at a 2 mod 3 position, which we can already do. Similarly, we can compare a 0 mod 3 suffix with a 2 mod 3 suffix, allowing all suffixes between our two partial suffix arrays to be compared, and so they can merged.

Again, the reason I find this surprising is simply that when I first heard about the problem, it was surprising to me that it could be done in linear time.