Sunday, October 25, 2009

XML API Internals

XML gurus define 5 styles of XML APIs:
  • event-based API or streaming API:
    • push model: SAX, Xerces Native Interface (XNI);
    • pull model: XPP3, StAX;
  • tree-based API: DOM, JDOM, dom4j;
  • data-binding API;
  • query API: XSLT, XPath;
More details could be found here.

Analysis illustrating how XML APIs relate to each other could be found here.

Estimated timeline of Java XML APIs could be found here.

Sunday, October 11, 2009

Optimize Queries in MySQL

List of resources about MySQL query optimization:

Interview Questions

Below is the list of Interview Questions by Jesse Farmer:

Saturday, July 4, 2009

Suffix Trees based Diff Java Applet

Suffix Trees based Diff Java Applet:




See also: my other Suffix Trees posts

Monday, June 29, 2009

Suffix Trees: Longest Common Substring and Diff Implementation

The Longest Common Substring (LCS) is the longest string that is a substring of two or more strings.
LCS could be found with using of Generalized Suffix Tree (GST).

Steps to find LCS:
  1. build generalized suffix tree:
    1. concatenate original strings;
    2. build suffix tree.
  2. calculate special status for each node of the generalized suffix tree:
    • original strings labels;
    • tree height or suffix length.
  3. find longest common substring - traverse the tree to find a node with maximal tree height and labeled with all original strings.

Download source code of Longest Common Substring and Diff Implementation.

Class diagram:

Helper Inner classes:
  • LCSNodeStatus - Suffix Tree Node Status for Longest Common Substring (LCS)
  • CommonSubstr - Result of searching Longest Common Substring. Also represents partial diff result.
Nodes statuses:

Node statuses could be calculated:
Map<Node, LCSNodeStatus> statuses = new HashMap<Node, LCSNodeStatus>();
getLCSNodeStatus(suffixTree.getRootNode(), 0, statuses);

where
private LCSNodeStatus getLCSNodeStatus(Node node, int height, Map<Node, LCSNodeStatus> statuses) {
LCSNodeStatus nodeStatus = new LCSNodeStatus(node, height);
if (node.getEdges().size() == 0) {
return nodeStatus;
}

for (Edge edge : node.getEdges()) {
LCSNodeStatus status = getLCSNodeStatus(edge.getEndNode(),
height + getEdgeHeightForNodeStatus(edge), statuses);

status.addString(getEdgeStringNumber(edge));
nodeStatus.mergeStatus(status);
}
statuses.put(node, nodeStatus);
return nodeStatus;
}

Find LCS:
private CommonSubstr getLcs(int beginIndexes[], int endIndexes[], Map<Node, LCSNodeStatus> statuses) {
int max = 0;
int foundBeginIndexes[] = null;

for (LCSNodeStatus status : statuses.values()) {
if ((status.getHeight() > 0) && (status.isAllStrings()) && (max <= status.getHeight())) {
Node node = status.getNode();
int workingBeginIndexes[] = initFoundBeginIndexes();

updateFoundBeginIndexes(beginIndexes, endIndexes, node, status.getHeight(),
statuses, workingBeginIndexes);

if (verifyFoundBeginIndexes(workingBeginIndexes)) {
foundBeginIndexes = workingBeginIndexes;
max = status.getHeight();
}
}
}

if (foundBeginIndexes == null)
return null;

return new CommonSubstr(foundBeginIndexes, max);
}

private int[] initFoundBeginIndexes() {
int beginIndexes[] = new int[texts.length];
for(int i=0; i<texts.length; i++)
beginIndexes[i] = Integer.MAX_VALUE;
return beginIndexes;
}

private void updateFoundBeginIndexes(int beginIndexes[], int endIndexes[], Node node, int height,
Map<Node, LCSNodeStatus> statuses, int[] foundBeginIndexes) {
for (Edge edge : node.getEdges()) {
LCSNodeStatus nodeStatus = statuses.get(edge.getEndNode());
if ((nodeStatus != null) && nodeStatus.isAllStrings()) {
updateFoundBeginIndexes(beginIndexes, endIndexes, edge.getEndNode(),
height + getEdgeHeight(edge), statuses, foundBeginIndexes);
} else {
int stringNumber = getEdgeStringNumber(edge);
int beginIndex = edge.getBeginIndex() - height;

if ((beginIndex < endIndexes[stringNumber]) &&
(beginIndex >= beginIndexes[stringNumber]) &&
(foundBeginIndexes[stringNumber] > beginIndex)) {

foundBeginIndexes[stringNumber] = beginIndex;
}
}
}
}

private boolean verifyFoundBeginIndexes(int[] beginIndexes) {
for(int i=0; i<texts.length; i++)
if (beginIndexes[i] == Integer.MAX_VALUE)
return false;
return true;
}

Diff Implementation:
  • find LCS;
  • call diff for left part of strings (before LCS);
  • call diff for right part of string (after LCS);
  • join results.

To define strings search parts 2 arrays are used: beginIndexes[], endIndexes[].
private List<CommonSubstr> diff(int beginIndexes[], int endIndexes[], Map<Node, LCSNodeStatus> statuses) {
CommonSubstr commonSubstr = getLcs(beginIndexes, endIndexes, statuses);
if (commonSubstr == null)
return new LinkedList<CommonSubstr>();

List<CommonSubstr> prev = diff(beginIndexes, commonSubstr.beginIndexes, statuses);
List<CommonSubstr> next = diff(incIndexes(commonSubstr.endIndexes), endIndexes, statuses);

prev.add(commonSubstr);
if (next != null)
prev.addAll(next);
return prev;
}

See also: my other Suffix Trees posts

Monday, June 8, 2009

Generalized Suffix Trees: Java Applet

A Generalized Suffix Tree is a Suffix Tree for a set of strings.
Usage of Generalized Suffix Trees Java Applet is similar to Suffix Trees Java Applet:
  • enter 2 strings;
  • # and $ are used as strings terminators;
  • press "Build GST" button;
  • it will build suffix tree and create visual presentation in some auto calculated layout;
  • drag and drop nodes to find best layout;
  • you need Java installed and enabled in your browser;
  • applet uses Java implementation of Ukkonen's Algorithm to build Suffix Tree.




Here are most used examples of Generalized Suffix Trees:



Source code of GeneralizedSuffixTree class that is based on SuffixTree class.
public class GeneralizedSuffixTree {
// SOH - START OF HEADING
public static final char TERMINATOR1 = '\u0001';
// STX - START OF TEXT
public static final char TERMINATOR2 = '\u0002';

private String texts[];
private char terminators[];

private String generalized;
private SuffixTree suffixTree;

public GeneralizedSuffixTree(String text1, String text2) {
this(text1, text2, TERMINATOR1, TERMINATOR2);
}

public GeneralizedSuffixTree(String text1, String text2, char terminator1, char terminator2) {
this(new String[]{text1, text2}, new char[]{terminator1, terminator2});
}

public GeneralizedSuffixTree(String texts[], char terminators[]) {
this.texts = texts;
this.terminators = terminators;

StringBuilder sb = new StringBuilder();
for (int i = 0; i < texts.length; i++) {
sb.append(texts[i]);
sb.append(terminators[i]);
}
generalized = sb.toString();
suffixTree = new SuffixTree(generalized);
fixSpanSuffixes(suffixTree.getRootNode());
}

private void fixSpanSuffixes(Node node) {
for (Edge edge : node.getEdges()) {
for (int i = 0; i < texts.length; i++) {
if ((edge.getBeginIndex() <= getGeneralizedSubstringLength(i)) &&
(edge.getEndIndex() > getGeneralizedSubstringLength(i))) {

edge.setEndIndex(getGeneralizedSubstringLength(i));
continue;
}
}
fixSpanSuffixes(edge.getEndNode());
}
}

private int getGeneralizedSubstringLength(int n) {
int length = 0;
for (int i = 0; i <= n; i++) {
length += texts[i].length() + 1;
}
return length - 1;
}
}

See also: my other Suffix Trees posts

Suffix Trees: Java Applet Sources

Suffix Trees Java Applet is based on JFC/Swing and Applets.

Applet main classes:
  • STApplet - main applet class;
  • STControlPanel - control panel ;
  • STDrawPanel - draw panel.
Applet main classes diagram:

GUI classes:
  • GUIObject - base class;
  • GUIObserver - observer interface;
  • GUINode - suffix tree node;
  • GUIEdge - suffix tree edge;
  • GUILabel - suffix label, linked to edge;
  • GUILink - suffix link;
  • ArrowHead - helper arrow shape.
GUI classes diagram:


Download Suffix Trees Java Applet Sources

See also: my other Suffix Trees posts