Saturday, April 18, 2009

Suffix Trees: Java Ukkonen's Algorithm

Suffix Tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations. Details...

Ukkonen's algorithm begins with an implicit suffix tree containing the first character of the string. Then it steps through the string adding successive characters until the tree is complete. This order addition of characters gives Ukkonen's algorithm its "on-line" property; earlier algorithms proceeded backward from the last character. The implementation of this algorithm requires O(n) (linear time).

Check Suffix Tree Java Source Code
This is a Java-port of Mark Nelson's C++ implementation of Ukkonen's algorithm.

Test run and validation is moved to SuffixTreeTest JUnit Test Case.
Could be tested with Maven or from IDE.

mvn test
...
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running com.blogspot.illyakeeplearning.suffixtree.SuffixTreeTest
Start End   Suf   First Last  String
0     5     0     1     1     a
0     3     5     0     1     ca
0     7     -1    4     4     o
3     1     -1    2     4     cao
3     4     -1    4     4     o
5     2     -1    2     4     cao
5     6     -1    4     4     o
Suffix : acao
comparing: acao to acao
Suffix : ao
comparing: ao to ao
Suffix : cacao
comparing: cacao to cacao
Suffix : cao
comparing: cao to cao
Suffix : o
comparing: o to o
All Suffixes present!
Leaf count : 5 OK
Branch count : 7 OK
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.032 sec

Results :

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

Future plans:
  • refactor to have less global variables
  • reuse java hash
References:

14 comments:

Illya Havsiyevych said...

see also Suffix Trees: Refactored Java Code

dkirasic said...

Does it work for "banana"?

Anonymous said...
This comment has been removed by a blog administrator.
audriusa said...

Under which license this code is available (GPL, LGPL, BSD, try before you buy, non commercial) ? This matters a lot if somebody wants to try to use the code in some non-hobby and non-research project.

audriusa said...

I decided to check the algorithm more seriously by extending automated test that initially passes. For that, I replaced the single pre-defined string with the loop over multiple, randomly generated strings, the beginning of the test now looking as

import static java.lang.System.out;
import static java.lang.System.err;
import java.util.Random;

public class SuffixTreeTest extends TestCase {
String line;
Random r = new Random(System.currentTimeMillis());

@Override
protected void setUp() throws Exception {
super.setUp();
}

public void testSeriously() {
StringBuilder b = new StringBuilder();
for (int i = 4; i < 30; i++) {
b.setLength(0);
for (int p = 0; p < i; p++) {
b.append( (char) ('a'+r.nextInt('z'-'a')));
}
b.append('$');
line = b.toString();
System.out.println("Test "+line);
_testSuffixTree();
}
}

(..)

Such extended version reports numerous failures on various strings, always with the same error message Logic error on node 1, not a leaf or internal node!. If the author is still active, it would be really nice to get some feedback on this result, or maybe even the corrected version.

P.S. The test string in the test must be terminated by the unique termination character. While cacao passes without it (as "o" is unique in this word), banana does not. However banana$ passes without issues. The examples of the failing string would be nboggh$ or vejuw$.

Unknown said...

Hi,

Do you have the algorithm to extract the maximal repeats from the suffix tree.

Thanks,

Luay Alawneh

Unknown said...
This comment has been removed by the author.
Prabhakar Srinivasan said...

Thanks Ilya, btw is your tree building in quadratic or linear time?
I really like the implementation. it is fast, and well-coded and has unit tests which make it so easy to understand and use.

Prabhakar
prabhakar.srinivasan@gmail.com

Blogger said...

Is not possible download code.

http://illya.yolasite.com/resources/suffix-tree.zip

403: Forbiden.

Unknown said...

Not possible to download the code

Unknown said...

Not possible to download the code

Vlad Patryshev said...

Man, is your Java code secret now? Download is not allowed.

Illya Havsiyevych said...

Dear all,
I was a bit busy last years.
But now I was able to find original sources and upload to GitHub

Anonymous said...

Great post, keep sharing...

Derek Surabaya | Derek Sidoarjo | Derek Probolinggo | Derek Pasuruan | Derek Banyuwangi | Malang | Derek Lumajang | Derek Bali | Derek Caruban | Derek Mojokerto | Derek Nganjuk | Derek Jombang | Derek Madiun | Derek Ngawi | Derek Sragen | Derek Solo | Derek Blora | Derek Cepu | Derek Jogja | Derek Salatiga | Derek Semarang | Derek Jakarta | Derek Bekasi | Derek Bandung | Derek Bogor | Derek Cirebon