Sunday, May 24, 2009

Suffix Trees: Refactored Java Code

In this post I will try to describe refactored java code.
Please refer previous post about Suffix Trees - Suffix Trees: Java Ukkonen's Algorithm

Tree structure:

Tree consists of:
  • nodes;
  • edges.
Edge has next data:
  • reference to start node;
  • reference to end node;
  • sub string begin index;
  • sub string end index.
Main methods:
  • split edge by a Suffix and return a new Node

Node
has next data:
  • reference to suffix node;
  • hash map of edge references;
  • name.
Main methods:
  • add Edge into hash map
  • remove Edge from hash map
  • find Edge in hash map by character

Suffix
is important for building tree. It has next data:
  • reference to origin node;
  • sub string begin index;
  • sub string end index.
Main methods:
  • canonize

All classes are wrapped by SuffixTree class that creates a tree. It has:
  • input text;
  • root Node.

Class diagram:



Future plans:
  • suffix trees use cases
  • further refactoring

Download Suffix Trees Refactored Java Code.
Several test runs are listed below:

book
Edge  Start   End   Suf   First Last  String
  1     0     1     -1    0     3     book
  3     0     3     0     1     1     o
  5     0     5     -1    3     3     k
  2     3     2     -1    2     3     ok
  4     3     4     -1    3     3     k

bookke
Edge  Start   End   Suf   First Last  String
  8     0     8     -1    5     5     e
  1     0     1     -1    0     5     bookke
  3     0     3     0     1     1     o
  6     0     6     0     3     3     k
  2     3     2     -1    2     5     okke
  4     3     4     -1    3     5     kke
  7     6     7     -1    5     5     e
  5     6     5     -1    4     5     ke

cacao
Edge  Start   End   Suf   First Last  String
  3     0     3     5     0     1     ca
  5     0     5     0     1     1     a
  7     0     7     -1    4     4     o
  1     3     1     -1    2     4     cao
  4     3     4     -1    4     4     o
  2     5     2     -1    2     4     cao
  6     5     6     -1    4     4     o

googol
Edge  Start   End   Suf   First Last  String
  5     0     5     3     0     1     go
  3     0     3     0     1     1     o
  8     0     8     -1    5     5     l
  1     5     1     -1    2     5     ogol
  6     5     6     -1    5     5     l
  4     3     4     -1    3     5     gol
  2     3     2     -1    2     5     ogol
  7     3     7     -1    5     5     l

abababc
Edge  Start   End   Suf   First Last  String
  9     0     9     0     1     1     b
  11    0     11    -1    6     6     c
  7     0     7     9     0     1     ab
  10    9     10    -1    6     6     c
  5     9     5     7     2     3     ab
  8     7     8     -1    6     6     c
  3     7     3     5     2     3     ab
  6     5     6     -1    6     6     c
  2     5     2     -1    4     6     abc
  4     3     4     -1    6     6     c
  1     3     1     -1    4     6     abc

5 comments:

Illya Havsiyevych said...

see also more updated Suffix Trees: Java Code as Separate Classes

Andrés said...

The link to the code is broken. Can you upload it again, please? Thanks!

Illya Havsiyevych said...

Fixed, Sorry

Anonymous said...

The code is broken... or has some bug

Start End Suf First Last String
0 1 -1 0 10 ATATATATATA
0 2 -1 1 10 TATATATATA
Suffix : ATATATATATA
comparing: ATATATATATA to ATATATATATA
Suffix : TATATATATA
comparing: TATATATATA to TATATATATA
Suffix 0 count wrong!
Suffix 1 count wrong!
Suffix 2 count wrong!
Suffix 3 count wrong!
Suffix 4 count wrong!
Suffix 5 count wrong!
Suffix 6 count wrong!
Suffix 7 count wrong!
Suffix 8 count wrong!
Leaf count : 2 Error!
Branch count : 2 OK

Illya Havsiyevych said...

Christina,
1) try to add $ to the end of your string;
2) try applet
http://illya-keeplearning.blogspot.com/2009/06/suffix-trees-java-applet.html