Keep Learning: Suffix Trees: Java Ukkonen's Algorithm

Saturday, April 18, 2009

Suffix Trees: Java Ukkonen's Algorithm

Suffix Tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations. Details...

Ukkonen's algorithm begins with an implicit suffix tree containing the first character of the string. Then it steps through the string adding successive characters until the tree is complete. This order addition of characters gives Ukkonen's algorithm its "on-line" property; earlier algorithms proceeded backward from the last character. The implementation of this algorithm requires O(n) (linear time).

Check Suffix Tree Java Source Code
This is a Java-port of Mark Nelson's C++ implementation of Ukkonen's algorithm.

Test run and validation is moved to SuffixTreeTest JUnit Test Case.
Could be tested with Maven or from IDE.

mvn test
...
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running com.blogspot.illyakeeplearning.suffixtree.SuffixTreeTest
Start End   Suf   First Last  String
0     5     0     1     1     a
0     3     5     0     1     ca
0     7     -1    4     4     o
3     1     -1    2     4     cao
3     4     -1    4     4     o
5     2     -1    2     4     cao
5     6     -1    4     4     o
Suffix : acao
comparing: acao to acao
Suffix : ao
comparing: ao to ao
Suffix : cacao
comparing: cacao to cacao
Suffix : cao
comparing: cao to cao
Suffix : o
comparing: o to o
All Suffixes present!
Leaf count : 5 OK
Branch count : 7 OK
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.032 sec

Results :

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

Future plans:

refactor to have less global variables
reuse java hash

References:

14 comments:

Illya Havsiyevych said...: see also Suffix Trees: Refactored Java Code; May 27, 2009 at 11:44 AM
dkirasic said...: Does it work for "banana"?; November 4, 2009 at 11:11 AM
Anonymous said...: This comment has been removed by a blog administrator.; November 23, 2009 at 7:50 AM
audriusa said...: Under which license this code is available (GPL, LGPL, BSD, try before you buy, non commercial) ? This matters a lot if somebody wants to try to use the code in some non-hobby and non-research project.; March 7, 2011 at 11:50 AM
audriusa said...: I decided to check the algorithm more seriously by extending automated test that initially passes. For that, I replaced the single pre-defined string with the loop over multiple, randomly generated strings, the beginning of the test now looking as

import static java.lang.System.out;
import static java.lang.System.err;
import java.util.Random;

public class SuffixTreeTest extends TestCase {
String line;
Random r = new Random(System.currentTimeMillis());

@Override
protected void setUp() throws Exception {
super.setUp();
}

public void testSeriously() {
StringBuilder b = new StringBuilder();
for (int i = 4; i < 30; i++) {
b.setLength(0);
for (int p = 0; p < i; p++) {
b.append( (char) ('a'+r.nextInt('z'-'a')));
}
b.append('$');
line = b.toString();
System.out.println("Test "+line);
_testSuffixTree();
}
}

(..)

Such extended version reports numerous failures on various strings, always with the same error message Logic error on node 1, not a leaf or internal node!. If the author is still active, it would be really nice to get some feedback on this result, or maybe even the corrected version.

P.S. The test string in the test must be terminated by the unique termination character. While cacao passes without it (as "o" is unique in this word), banana does not. However banana$ passes without issues. The examples of the failing string would be nboggh$ or vejuw$.; March 7, 2011 at 2:28 PM
Unknown said...: Hi,

Do you have the algorithm to extract the maximal repeats from the suffix tree.

Thanks,

Luay Alawneh; April 20, 2011 at 5:53 PM
Unknown said...: This comment has been removed by the author.; April 20, 2011 at 5:55 PM
Prabhakar Srinivasan said...: Thanks Ilya, btw is your tree building in quadratic or linear time?
I really like the implementation. it is fast, and well-coded and has unit tests which make it so easy to understand and use.

Prabhakar
prabhakar.srinivasan@gmail.com; October 10, 2011 at 12:48 PM
Blogger said...: Is not possible download code.

http://illya.yolasite.com/resources/suffix-tree.zip

403: Forbiden.; October 28, 2016 at 8:28 AM
Unknown said...: Not possible to download the code; January 10, 2017 at 6:09 AM
Unknown said...: Not possible to download the code; January 10, 2017 at 6:10 AM
Vlad Patryshev said...: Man, is your Java code secret now? Download is not allowed.; January 30, 2017 at 7:38 AM
Illya Havsiyevych said...: Dear all,
I was a bit busy last years.
But now I was able to find original sources and upload to GitHub; September 7, 2017 at 3:06 PM
Anonymous said...: Great post, keep sharing...

Derek Surabaya | Derek Sidoarjo | Derek Probolinggo | Derek Pasuruan | Derek Banyuwangi | Malang | Derek Lumajang | Derek Bali | Derek Caruban | Derek Mojokerto | Derek Nganjuk | Derek Jombang | Derek Madiun | Derek Ngawi | Derek Sragen | Derek Solo | Derek Blora | Derek Cepu | Derek Jogja | Derek Salatiga | Derek Semarang | Derek Jakarta | Derek Bekasi | Derek Bandung | Derek Bogor | Derek Cirebon; October 18, 2019 at 11:58 AM

Keep Learning

Saturday, April 18, 2009

Suffix Trees: Java Ukkonen's Algorithm

14 comments:

About myself

Labels

ToDo

History

Formatters

University

Followers

Readers

See also

Podcasts

Keep Learning

Saturday, April 18, 2009

Suffix Trees: Java Ukkonen's Algorithm

14 comments:

About myself

Labels

ToDo

History

Formatters

University

Followers

Subscribe To

Readers

See also

Podcasts