Keep Learning: June 2009

Monday, June 29, 2009

Suffix Trees: Longest Common Substring and Diff Implementation

The Longest Common Substring (LCS) is the longest string that is a substring of two or more strings.
LCS could be found with using of Generalized Suffix Tree (GST).

Steps to find LCS:

build generalized suffix tree:
1. concatenate original strings;
2. build suffix tree.
calculate special status for each node of the generalized suffix tree:
- original strings labels;
- tree height or suffix length.
find longest common substring - traverse the tree to find a node with maximal tree height and labeled with all original strings.

Download source code of Longest Common Substring and Diff Implementation.

Class diagram:

Helper Inner classes:

LCSNodeStatus - Suffix Tree Node Status for Longest Common Substring (LCS)
CommonSubstr - Result of searching Longest Common Substring. Also represents partial diff result.

Nodes statuses:

Node statuses could be calculated:

Map<Node, LCSNodeStatus> statuses = new HashMap<Node, LCSNodeStatus>();
getLCSNodeStatus(suffixTree.getRootNode(), 0, statuses);

where

private LCSNodeStatus getLCSNodeStatus(Node node, int height, Map<Node, LCSNodeStatus> statuses) {
    LCSNodeStatus nodeStatus = new LCSNodeStatus(node, height);
    if (node.getEdges().size() == 0) {
        return nodeStatus;
    }
  
    for (Edge edge : node.getEdges()) {
        LCSNodeStatus status = getLCSNodeStatus(edge.getEndNode(),
                height + getEdgeHeightForNodeStatus(edge), statuses);
  
        status.addString(getEdgeStringNumber(edge));
        nodeStatus.mergeStatus(status);
    }
    statuses.put(node, nodeStatus);
    return nodeStatus;
}

Find LCS:

private CommonSubstr getLcs(int beginIndexes[], int endIndexes[], Map<Node, LCSNodeStatus> statuses) {
    int max = 0;
    int foundBeginIndexes[] = null;
  
    for (LCSNodeStatus status : statuses.values()) {
        if ((status.getHeight() > 0) && (status.isAllStrings()) && (max <= status.getHeight())) {
            Node node = status.getNode();
            int workingBeginIndexes[] = initFoundBeginIndexes();
  
            updateFoundBeginIndexes(beginIndexes, endIndexes, node, status.getHeight(), 
                    statuses, workingBeginIndexes);
  
            if (verifyFoundBeginIndexes(workingBeginIndexes)) {
                foundBeginIndexes = workingBeginIndexes;
                max = status.getHeight();
            }
        }
    }
  
    if (foundBeginIndexes == null)
        return null;
  
    return new CommonSubstr(foundBeginIndexes, max);
}
  
private int[] initFoundBeginIndexes() {
    int beginIndexes[] = new int[texts.length];
    for(int i=0; i<texts.length; i++)
        beginIndexes[i] = Integer.MAX_VALUE;
    return beginIndexes;
}
  
private void updateFoundBeginIndexes(int beginIndexes[], int endIndexes[], Node node, int height,
                                        Map<Node, LCSNodeStatus> statuses, int[] foundBeginIndexes) {
    for (Edge edge : node.getEdges()) {
        LCSNodeStatus nodeStatus = statuses.get(edge.getEndNode());
        if ((nodeStatus != null) && nodeStatus.isAllStrings()) {
            updateFoundBeginIndexes(beginIndexes, endIndexes, edge.getEndNode(),
                        height + getEdgeHeight(edge), statuses, foundBeginIndexes);
        } else {
            int stringNumber = getEdgeStringNumber(edge);
            int beginIndex = edge.getBeginIndex() - height;
  
            if ((beginIndex < endIndexes[stringNumber]) &&
                    (beginIndex >= beginIndexes[stringNumber]) &&
                    (foundBeginIndexes[stringNumber] > beginIndex)) {
  
                foundBeginIndexes[stringNumber] = beginIndex;
            }
        }
    }
}

private boolean verifyFoundBeginIndexes(int[] beginIndexes) {
    for(int i=0; i<texts.length; i++)
        if (beginIndexes[i] == Integer.MAX_VALUE)
            return false;
    return true;
}

Diff Implementation:

find LCS;
call diff for left part of strings (before LCS);
call diff for right part of string (after LCS);
join results.

To define strings search parts 2 arrays are used: beginIndexes[], endIndexes[].

private List<CommonSubstr> diff(int beginIndexes[], int endIndexes[], Map<Node, LCSNodeStatus> statuses) {
    CommonSubstr commonSubstr = getLcs(beginIndexes, endIndexes, statuses);
    if (commonSubstr == null)
        return new LinkedList<CommonSubstr>();
  
    List<CommonSubstr> prev = diff(beginIndexes, commonSubstr.beginIndexes, statuses);
    List<CommonSubstr> next = diff(incIndexes(commonSubstr.endIndexes), endIndexes, statuses);
  
    prev.add(commonSubstr);
    if (next != null)
        prev.addAll(next);
    return prev;
}

Monday, June 8, 2009

Generalized Suffix Trees: Java Applet

A Generalized Suffix Tree is a Suffix Tree for a set of strings.
Usage of Generalized Suffix Trees Java Applet is similar to Suffix Trees Java Applet:

enter 2 strings;
# and $ are used as strings terminators;
press "Build GST" button;
it will build suffix tree and create visual presentation in some auto calculated layout;
drag and drop nodes to find best layout;
you need Java installed and enabled in your browser;
applet uses Java implementation of Ukkonen's Algorithm to build Suffix Tree.

Here are most used examples of Generalized Suffix Trees:

Source code of GeneralizedSuffixTree class that is based on SuffixTree class.

public class GeneralizedSuffixTree {
    // SOH - START OF HEADING
    public static final char TERMINATOR1 = '\u0001';
    // STX - START OF TEXT
    public static final char TERMINATOR2 = '\u0002';
   
    private String texts[];
    private char terminators[];
   
    private String generalized;
    private SuffixTree suffixTree;
   
    public GeneralizedSuffixTree(String text1, String text2) {
        this(text1, text2, TERMINATOR1, TERMINATOR2);
    }
   
    public GeneralizedSuffixTree(String text1, String text2, char terminator1, char terminator2) {
        this(new String[]{text1, text2}, new char[]{terminator1, terminator2});
    }
   
    public GeneralizedSuffixTree(String texts[], char terminators[]) {
        this.texts = texts;
        this.terminators = terminators;
   
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < texts.length; i++) {
            sb.append(texts[i]);
            sb.append(terminators[i]);
        }
        generalized = sb.toString();
        suffixTree = new SuffixTree(generalized);
        fixSpanSuffixes(suffixTree.getRootNode());
    }
   
    private void fixSpanSuffixes(Node node) {
        for (Edge edge : node.getEdges()) {
            for (int i = 0; i < texts.length; i++) {
                if ((edge.getBeginIndex() <= getGeneralizedSubstringLength(i)) &&
                        (edge.getEndIndex() > getGeneralizedSubstringLength(i))) {
   
                    edge.setEndIndex(getGeneralizedSubstringLength(i));
                    continue;
                }
            }
            fixSpanSuffixes(edge.getEndNode());
        }
    }
   
    private int getGeneralizedSubstringLength(int n) {
        int length = 0;
        for (int i = 0; i <= n; i++) {
            length += texts[i].length() + 1;
        }
        return length - 1;
    }
}

Suffix Trees: Java Applet Sources

Suffix Trees Java Applet is based on JFC/Swing and Applets.

Applet main classes:

STApplet - main applet class;
STControlPanel - control panel ;
STDrawPanel - draw panel.

Applet main classes diagram:

GUI classes:

GUIObject - base class;
GUIObserver - observer interface;
GUINode - suffix tree node;
GUIEdge - suffix tree edge;
GUILabel - suffix label, linked to edge;
GUILink - suffix link;
ArrowHead - helper arrow shape.

GUI classes diagram:

Download Suffix Trees Java Applet Sources

See also: my other Suffix Trees posts

Suffix Trees: sample trees

Here are most used samples of suffix trees.
They are built online with Suffix Trees Java Applet

Suffix Trees: Java Applet

How to use Suffix Trees Java Applet:

enter string;
press "Build Suffix Tree" button;
it will build suffix tree and create visual presentation in some auto calculated layout;
drag and drop nodes to find best layout;
you need Java installed and enabled in your browser;
applet uses Java implementation of Ukkonen's Algorithm.

Some samples are available here.

Applet sources are discussed and could be download here.

Please refer previous posts about Suffix Trees:

Friday, June 5, 2009

Recent Google News

Some recent Google news:

Page Speed
- Introducing Page Speed
- Page Speed
Google Chrome
- Google Chrome Release Channels
GWT
- Google Wave: Powered by GWT

Keep Learning

Monday, June 29, 2009

Suffix Trees: Longest Common Substring and Diff Implementation

Monday, June 8, 2009

Generalized Suffix Trees: Java Applet

Suffix Trees: Java Applet Sources

Suffix Trees: sample trees

Suffix Trees: Java Applet

Friday, June 5, 2009

Recent Google News

About myself

Labels

ToDo

History

Formatters

University

Followers

Readers

See also

Podcasts

Keep Learning

Monday, June 29, 2009

Suffix Trees: Longest Common Substring and Diff Implementation

Monday, June 8, 2009

Generalized Suffix Trees: Java Applet

Suffix Trees: Java Applet Sources

Suffix Trees: sample trees

Suffix Trees: Java Applet

Friday, June 5, 2009

Recent Google News

About myself

Labels

ToDo

History

Formatters

University

Followers

Subscribe To

Readers

See also

Podcasts