Sunday, October 25, 2009

XML API Internals

XML gurus define 5 styles of XML APIs:
  • event-based API or streaming API:
    • push model: SAX, Xerces Native Interface (XNI);
    • pull model: XPP3, StAX;
  • tree-based API: DOM, JDOM, dom4j;
  • data-binding API;
  • query API: XSLT, XPath;
More details could be found here.

Analysis illustrating how XML APIs relate to each other could be found here.

Estimated timeline of Java XML APIs could be found here.

Saturday, July 4, 2009

Suffix Trees based Diff Java Applet

Suffix Trees based Diff Java Applet:




See also: my other Suffix Trees posts

Monday, June 29, 2009

Suffix Trees: Longest Common Substring and Diff Implementation

The Longest Common Substring (LCS) is the longest string that is a substring of two or more strings.
LCS could be found with using of Generalized Suffix Tree (GST).

Steps to find LCS:
  1. build generalized suffix tree:
    1. concatenate original strings;
    2. build suffix tree.
  2. calculate special status for each node of the generalized suffix tree:
    • original strings labels;
    • tree height or suffix length.
  3. find longest common substring - traverse the tree to find a node with maximal tree height and labeled with all original strings.

Download source code of Longest Common Substring and Diff Implementation.

Class diagram:

Helper Inner classes:
  • LCSNodeStatus - Suffix Tree Node Status for Longest Common Substring (LCS)
  • CommonSubstr - Result of searching Longest Common Substring. Also represents partial diff result.
Nodes statuses:

Node statuses could be calculated:
Map<Node, LCSNodeStatus> statuses = new HashMap<Node, LCSNodeStatus>();
getLCSNodeStatus(suffixTree.getRootNode(), 0, statuses);

where
private LCSNodeStatus getLCSNodeStatus(Node node, int height, Map<Node, LCSNodeStatus> statuses) {
LCSNodeStatus nodeStatus = new LCSNodeStatus(node, height);
if (node.getEdges().size() == 0) {
return nodeStatus;
}

for (Edge edge : node.getEdges()) {
LCSNodeStatus status = getLCSNodeStatus(edge.getEndNode(),
height + getEdgeHeightForNodeStatus(edge), statuses);

status.addString(getEdgeStringNumber(edge));
nodeStatus.mergeStatus(status);
}
statuses.put(node, nodeStatus);
return nodeStatus;
}

Find LCS:
private CommonSubstr getLcs(int beginIndexes[], int endIndexes[], Map<Node, LCSNodeStatus> statuses) {
int max = 0;
int foundBeginIndexes[] = null;

for (LCSNodeStatus status : statuses.values()) {
if ((status.getHeight() > 0) && (status.isAllStrings()) && (max <= status.getHeight())) {
Node node = status.getNode();
int workingBeginIndexes[] = initFoundBeginIndexes();

updateFoundBeginIndexes(beginIndexes, endIndexes, node, status.getHeight(),
statuses, workingBeginIndexes);

if (verifyFoundBeginIndexes(workingBeginIndexes)) {
foundBeginIndexes = workingBeginIndexes;
max = status.getHeight();
}
}
}

if (foundBeginIndexes == null)
return null;

return new CommonSubstr(foundBeginIndexes, max);
}

private int[] initFoundBeginIndexes() {
int beginIndexes[] = new int[texts.length];
for(int i=0; i<texts.length; i++)
beginIndexes[i] = Integer.MAX_VALUE;
return beginIndexes;
}

private void updateFoundBeginIndexes(int beginIndexes[], int endIndexes[], Node node, int height,
Map<Node, LCSNodeStatus> statuses, int[] foundBeginIndexes) {
for (Edge edge : node.getEdges()) {
LCSNodeStatus nodeStatus = statuses.get(edge.getEndNode());
if ((nodeStatus != null) && nodeStatus.isAllStrings()) {
updateFoundBeginIndexes(beginIndexes, endIndexes, edge.getEndNode(),
height + getEdgeHeight(edge), statuses, foundBeginIndexes);
} else {
int stringNumber = getEdgeStringNumber(edge);
int beginIndex = edge.getBeginIndex() - height;

if ((beginIndex < endIndexes[stringNumber]) &&
(beginIndex >= beginIndexes[stringNumber]) &&
(foundBeginIndexes[stringNumber] > beginIndex)) {

foundBeginIndexes[stringNumber] = beginIndex;
}
}
}
}

private boolean verifyFoundBeginIndexes(int[] beginIndexes) {
for(int i=0; i<texts.length; i++)
if (beginIndexes[i] == Integer.MAX_VALUE)
return false;
return true;
}

Diff Implementation:
  • find LCS;
  • call diff for left part of strings (before LCS);
  • call diff for right part of string (after LCS);
  • join results.

To define strings search parts 2 arrays are used: beginIndexes[], endIndexes[].
private List<CommonSubstr> diff(int beginIndexes[], int endIndexes[], Map<Node, LCSNodeStatus> statuses) {
CommonSubstr commonSubstr = getLcs(beginIndexes, endIndexes, statuses);
if (commonSubstr == null)
return new LinkedList<CommonSubstr>();

List<CommonSubstr> prev = diff(beginIndexes, commonSubstr.beginIndexes, statuses);
List<CommonSubstr> next = diff(incIndexes(commonSubstr.endIndexes), endIndexes, statuses);

prev.add(commonSubstr);
if (next != null)
prev.addAll(next);
return prev;
}

See also: my other Suffix Trees posts

Monday, June 8, 2009

Generalized Suffix Trees: Java Applet

A Generalized Suffix Tree is a Suffix Tree for a set of strings.
Usage of Generalized Suffix Trees Java Applet is similar to Suffix Trees Java Applet:
  • enter 2 strings;
  • # and $ are used as strings terminators;
  • press "Build GST" button;
  • it will build suffix tree and create visual presentation in some auto calculated layout;
  • drag and drop nodes to find best layout;
  • you need Java installed and enabled in your browser;
  • applet uses Java implementation of Ukkonen's Algorithm to build Suffix Tree.




Here are most used examples of Generalized Suffix Trees:



Source code of GeneralizedSuffixTree class that is based on SuffixTree class.
public class GeneralizedSuffixTree {
// SOH - START OF HEADING
public static final char TERMINATOR1 = '\u0001';
// STX - START OF TEXT
public static final char TERMINATOR2 = '\u0002';

private String texts[];
private char terminators[];

private String generalized;
private SuffixTree suffixTree;

public GeneralizedSuffixTree(String text1, String text2) {
this(text1, text2, TERMINATOR1, TERMINATOR2);
}

public GeneralizedSuffixTree(String text1, String text2, char terminator1, char terminator2) {
this(new String[]{text1, text2}, new char[]{terminator1, terminator2});
}

public GeneralizedSuffixTree(String texts[], char terminators[]) {
this.texts = texts;
this.terminators = terminators;

StringBuilder sb = new StringBuilder();
for (int i = 0; i < texts.length; i++) {
sb.append(texts[i]);
sb.append(terminators[i]);
}
generalized = sb.toString();
suffixTree = new SuffixTree(generalized);
fixSpanSuffixes(suffixTree.getRootNode());
}

private void fixSpanSuffixes(Node node) {
for (Edge edge : node.getEdges()) {
for (int i = 0; i < texts.length; i++) {
if ((edge.getBeginIndex() <= getGeneralizedSubstringLength(i)) &&
(edge.getEndIndex() > getGeneralizedSubstringLength(i))) {

edge.setEndIndex(getGeneralizedSubstringLength(i));
continue;
}
}
fixSpanSuffixes(edge.getEndNode());
}
}

private int getGeneralizedSubstringLength(int n) {
int length = 0;
for (int i = 0; i <= n; i++) {
length += texts[i].length() + 1;
}
return length - 1;
}
}

See also: my other Suffix Trees posts

Suffix Trees: Java Applet Sources

Suffix Trees Java Applet is based on JFC/Swing and Applets.

Applet main classes:
  • STApplet - main applet class;
  • STControlPanel - control panel ;
  • STDrawPanel - draw panel.
Applet main classes diagram:

GUI classes:
  • GUIObject - base class;
  • GUIObserver - observer interface;
  • GUINode - suffix tree node;
  • GUIEdge - suffix tree edge;
  • GUILabel - suffix label, linked to edge;
  • GUILink - suffix link;
  • ArrowHead - helper arrow shape.
GUI classes diagram:


Download Suffix Trees Java Applet Sources

See also: my other Suffix Trees posts

Suffix Trees: sample trees

Here are most used samples of suffix trees.
They are built online with Suffix Trees Java Applet









See also: my other Suffix Trees posts

Suffix Trees: Java Applet

How to use Suffix Trees Java Applet:
  • enter string;
  • press "Build Suffix Tree" button;
  • it will build suffix tree and create visual presentation in some auto calculated layout;
  • drag and drop nodes to find best layout;
  • you need Java installed and enabled in your browser;
  • applet uses Java implementation of Ukkonen's Algorithm.
Some samples are available here.

Applet sources are discussed and could be download here.




Please refer previous posts about Suffix Trees:

Sunday, May 31, 2009

Suffix Trees: Java Code as Separate Classes

In this post - Suffix Trees: Java Code as Separate Classes.
Download java source code.

Please refer previous posts about Suffix Trees:
Class diagram:

contains and indexOf functions:
public boolean contains(String str) {
    return indexOf(str) >= 0;
}
   public int indexOf(String str) {
    if (str.length() == 0)
        return -1;
   
    int index = -1;
    Node node = root;
   
    int i = 0;
    while (i<str.length()) {
        if ((node == null) || (i == text.length()))
            return -1;
   
        Edge edge = node.findEdge(str.charAt(i));
        if (edge == null)
            return -1;
   
        index = edge.getBeginIndex()-i;
        i++;
   
        for(int j=edge.getBeginIndex()+1; j<=edge.getEndIndex(); j++) {
            if (i == str.length())
                break;
            if (text.charAt(j) != str.charAt(i))
                return -1;
            i++;
        }
        node = edge.getEndNode();
    }
    return index;
}

Future plans:
  • Generalized Suffix Tree;
  • Longest Common SubString;

More information about Suffix Trees Use Cases you can find in:

Friday, May 29, 2009

Stored Procedures in MySQL 5.x: log / trace messages

There is no standard way to log / trace message in MySQL.
But there is a trick - execute SQL queries like next:
SELECT 'log message' AS 'title';

So to have Log Message feature we need next stored procedures:
  • procedure to execute dynamic SQL statements
  • procedure to create SELECT SQL query and simulate log output

See the code below:

DELIMITER $$


-- PROCEDURE pLog
-- outputs log message
-- Params:
-- sTitle - title
-- sMsg - log message
DROP PROCEDURE IF EXISTS `pLog` $$
CREATE PROCEDURE `pLog`(IN sTitle VARCHAR(255), IN sMsg VARCHAR(255))
BEGIN
DECLARE strSQL VARCHAR(512);

SET strSQL = CONCAT('SELECT ''', sMsg, ''' AS ''', sTitle, '''');
CALL pExecuteImmediate(strSQL);
END $$


-- PROCEDURE pExecuteImmediate
-- executes dynamic SQL statement
-- Params:
-- tSQLStmt - SQL statement to be executed
DROP PROCEDURE IF EXISTS `pExecuteImmediate` $$
CREATE PROCEDURE `pExecuteImmediate`(IN tSQLStmt TEXT)
BEGIN
SET @executeImmediateSQL = tSQLStmt;
PREPARE executeImmediateSTML FROM @executeImmediateSQL;
EXECUTE executeImmediateSTML;
DEALLOCATE PREPARE executeImmediateSTML;
END $$


DELIMITER ;

Function pCheckCurrentUserRoot contains examples of using pLog

DELIMITER $$


-- FUNCTION pCheckCurrentUserRoot
-- checks if we are connected as root user
DROP PROCEDURE IF EXISTS `pCheckCurrentUserRoot` $$
CREATE PROCEDURE `pCheckCurrentUserRoot`()
BEGIN
DECLARE strUser VARCHAR(50);

CALL pLog('INFO', 'pCheckCurrentUserRoot...');

SELECT USER() INTO strUser;
CALL pLog('INFO', CONCAT('user: ', strUser));

IF (strUser LIKE 'root%') THEN
CALL pLog('INFO', 'current user is root');
ELSE
CALL pLog('INFO', 'current user is not root');
END IF;
END $$


DELIMITER ;

Additional information about Stored Procedures in MySQL 5.x could be found here:

Sunday, May 24, 2009

Suffix Trees: Refactored Java Code

In this post I will try to describe refactored java code.
Please refer previous post about Suffix Trees - Suffix Trees: Java Ukkonen's Algorithm

Tree structure:

Tree consists of:
  • nodes;
  • edges.
Edge has next data:
  • reference to start node;
  • reference to end node;
  • sub string begin index;
  • sub string end index.
Main methods:
  • split edge by a Suffix and return a new Node

Node
has next data:
  • reference to suffix node;
  • hash map of edge references;
  • name.
Main methods:
  • add Edge into hash map
  • remove Edge from hash map
  • find Edge in hash map by character

Suffix
is important for building tree. It has next data:
  • reference to origin node;
  • sub string begin index;
  • sub string end index.
Main methods:
  • canonize

All classes are wrapped by SuffixTree class that creates a tree. It has:
  • input text;
  • root Node.

Class diagram:



Future plans:
  • suffix trees use cases
  • further refactoring

Download Suffix Trees Refactored Java Code.
Several test runs are listed below:

book
Edge  Start   End   Suf   First Last  String
  1     0     1     -1    0     3     book
  3     0     3     0     1     1     o
  5     0     5     -1    3     3     k
  2     3     2     -1    2     3     ok
  4     3     4     -1    3     3     k

bookke
Edge  Start   End   Suf   First Last  String
  8     0     8     -1    5     5     e
  1     0     1     -1    0     5     bookke
  3     0     3     0     1     1     o
  6     0     6     0     3     3     k
  2     3     2     -1    2     5     okke
  4     3     4     -1    3     5     kke
  7     6     7     -1    5     5     e
  5     6     5     -1    4     5     ke

cacao
Edge  Start   End   Suf   First Last  String
  3     0     3     5     0     1     ca
  5     0     5     0     1     1     a
  7     0     7     -1    4     4     o
  1     3     1     -1    2     4     cao
  4     3     4     -1    4     4     o
  2     5     2     -1    2     4     cao
  6     5     6     -1    4     4     o

googol
Edge  Start   End   Suf   First Last  String
  5     0     5     3     0     1     go
  3     0     3     0     1     1     o
  8     0     8     -1    5     5     l
  1     5     1     -1    2     5     ogol
  6     5     6     -1    5     5     l
  4     3     4     -1    3     5     gol
  2     3     2     -1    2     5     ogol
  7     3     7     -1    5     5     l

abababc
Edge  Start   End   Suf   First Last  String
  9     0     9     0     1     1     b
  11    0     11    -1    6     6     c
  7     0     7     9     0     1     ab
  10    9     10    -1    6     6     c
  5     9     5     7     2     3     ab
  8     7     8     -1    6     6     c
  3     7     3     5     2     3     ab
  6     5     6     -1    6     6     c
  2     5     2     -1    4     6     abc
  4     3     4     -1    6     6     c
  1     3     1     -1    4     6     abc

Saturday, April 18, 2009

Suffix Trees: Java Ukkonen's Algorithm

Suffix Tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations. Details...

Ukkonen's algorithm begins with an implicit suffix tree containing the first character of the string. Then it steps through the string adding successive characters until the tree is complete. This order addition of characters gives Ukkonen's algorithm its "on-line" property; earlier algorithms proceeded backward from the last character. The implementation of this algorithm requires O(n) (linear time).

Check Suffix Tree Java Source Code
This is a Java-port of Mark Nelson's C++ implementation of Ukkonen's algorithm.

Test run and validation is moved to SuffixTreeTest JUnit Test Case.
Could be tested with Maven or from IDE.

mvn test
...
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running com.blogspot.illyakeeplearning.suffixtree.SuffixTreeTest
Start End   Suf   First Last  String
0     5     0     1     1     a
0     3     5     0     1     ca
0     7     -1    4     4     o
3     1     -1    2     4     cao
3     4     -1    4     4     o
5     2     -1    2     4     cao
5     6     -1    4     4     o
Suffix : acao
comparing: acao to acao
Suffix : ao
comparing: ao to ao
Suffix : cacao
comparing: cacao to cacao
Suffix : cao
comparing: cao to cao
Suffix : o
comparing: o to o
All Suffixes present!
Leaf count : 5 OK
Branch count : 7 OK
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.032 sec

Results :

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

Future plans:
  • refactor to have less global variables
  • reuse java hash
References:

Sunday, March 29, 2009

Formal Grammars and Tools for Java

A grammar is formally defined as the ordered quad-tuple (N,Σ,P,S),
where:
  • a finite set N of nonterminal symbols;
  • a finite set Σ of terminal symbols that is disjoint from N;
  • a finite set P of production rules;
  • a distinguished symbol S from set N that is the start symbol.
Chomsky hierarchy of classes of formal grammars:
  • type 0 - unrestricted grammars - include all formal grammars;
  • type 1 - context-sensitive grammars - generate the context-sensitive languages;
  • type 2 - context-free grammars - generate the context-free languages. Context free languages are the theoretical basis for the syntax of most programming languages;
  • type 3 - regular grammars - generate the regular languages. Regular languages are commonly used to define search patterns and the lexical structure of programming languages.


In computer science, Extended Backus–Naur Form (EBNF) is a metasyntax notation used to express context-free grammars: that is, a formal way to describe computer programming languages and other formal languages. It is an extension of the basic Backus–Naur Form (BNF) metasyntax notation.

Compiler of computer programming languages consists of:
Types of parsers:
Sample:
  • grammar:
    S ::= Ax
    A ::= a
    A ::= b

  • input sequence:
    ax

  • Top-down parsing:
    S → Ax → ax

  • Bottom-up parsing:
    ax → Ax → S
Top-down parsers:
  • LL parser (Left-to-right, Leftmost derivation)
Bottom-up parsers:
  • LR parser (Left-to-right, Rightmost derivation)
  • SLR parser (Simple LR parser)
  • LALR parser (lookahead LR parser)
In good review of Lex and Yacc for Java next tools were selected:
I think that ANTLR is one of the best tools because:

Screens of ANTLR GUI: