Saturday, April 18, 2009

Suffix Trees: Java Ukkonen's Algorithm

Suffix Tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations. Details...

Ukkonen's algorithm begins with an implicit suffix tree containing the first character of the string. Then it steps through the string adding successive characters until the tree is complete. This order addition of characters gives Ukkonen's algorithm its "on-line" property; earlier algorithms proceeded backward from the last character. The implementation of this algorithm requires O(n) (linear time).

Check Suffix Tree Java Source Code
This is a Java-port of Mark Nelson's C++ implementation of Ukkonen's algorithm.

Test run and validation is moved to SuffixTreeTest JUnit Test Case.
Could be tested with Maven or from IDE.

mvn test
...
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running com.blogspot.illyakeeplearning.suffixtree.SuffixTreeTest
Start End Suf First Last String
0 5 0 1 1 a
0 3 5 0 1 ca
0 7 -1 4 4 o
3 1 -1 2 4 cao
3 4 -1 4 4 o
5 2 -1 2 4 cao
5 6 -1 4 4 o
Suffix : acao
comparing: acao to acao
Suffix : ao
comparing: ao to ao
Suffix : cacao
comparing: cacao to cacao
Suffix : cao
comparing: cao to cao
Suffix : o
comparing: o to o
All Suffixes present!
Leaf count : 5 OK
Branch count : 7 OK
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.032 sec

Results :

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

Future plans:
  • refactor to have less global variables
  • reuse java hash
References:

8 comments:

Illya Havsiyevych said...

see also Suffix Trees: Refactored Java Code

dkirasic said...

Does it work for "banana"?

pbsl said...
This comment has been removed by a blog administrator.
Audrius said...

Under which license this code is available (GPL, LGPL, BSD, try before you buy, non commercial) ? This matters a lot if somebody wants to try to use the code in some non-hobby and non-research project.

Audrius said...

I decided to check the algorithm more seriously by extending automated test that initially passes. For that, I replaced the single pre-defined string with the loop over multiple, randomly generated strings, the beginning of the test now looking as

import static java.lang.System.out;
import static java.lang.System.err;
import java.util.Random;

public class SuffixTreeTest extends TestCase {
String line;
Random r = new Random(System.currentTimeMillis());

@Override
protected void setUp() throws Exception {
super.setUp();
}

public void testSeriously() {
StringBuilder b = new StringBuilder();
for (int i = 4; i < 30; i++) {
b.setLength(0);
for (int p = 0; p < i; p++) {
b.append( (char) ('a'+r.nextInt('z'-'a')));
}
b.append('$');
line = b.toString();
System.out.println("Test "+line);
_testSuffixTree();
}
}

(..)

Such extended version reports numerous failures on various strings, always with the same error message Logic error on node 1, not a leaf or internal node!. If the author is still active, it would be really nice to get some feedback on this result, or maybe even the corrected version.

P.S. The test string in the test must be terminated by the unique termination character. While cacao passes without it (as "o" is unique in this word), banana does not. However banana$ passes without issues. The examples of the failing string would be nboggh$ or vejuw$.

Luay said...

Hi,

Do you have the algorithm to extract the maximal repeats from the suffix tree.

Thanks,

Luay Alawneh

Luay said...
This comment has been removed by the author.
Dr.Phosphorus said...

Thanks Ilya, btw is your tree building in quadratic or linear time?
I really like the implementation. it is fast, and well-coded and has unit tests which make it so easy to understand and use.

Prabhakar
prabhakar.srinivasan@gmail.com