How can you arrange the words of an input text lexicographically? We can do it using Brute Force method.But is there any other better way? Yes, there are. One such way is by using Trie or Prefix Tree Data Structure. This document shows how to do it using practical code.
Arranging the words of a text lexicographically trie
1. SOM-ITSOLUTIONS
ALGORITHM
Arranging the words of an Input Text Lexicographically - Trie
SOMENATH MUKHOPADHYAY
som-itsolutions
#A2 1/13 South Purbachal Hospital Road Kolkata 700078 Mob: +91 9748185282
Email: som.mukhopadhyay@som-itsolutions.com / som.mukhopadhyay@gmail.com
Website: http://www.som-itsolutions.com/
Blog: www.som-itsolutions.blogspot.com
2. As i am trying to recapitulate my knowledge base vis-a-vis data structure and algorithm, i am
trying to solve various practical real life problems. The real aim is to prepare my students for
programming olympiad. Thus while browsing through the Indian Association for Research in
Computing Science, i got a this problem :
“
Problem : Word List
In this problem the input will consist of a number of lines of English text consisting of the letters of
the English alphabet, the punctuation marks ' (apostrophe), . (full stop), , (comma), ; (semicolon),
:(colon) and white space characters (blank, newline).
Your task is print the words in the text in lexicographic order (that is, dictionary order). Each word
should appear exactly once in your list. You can ignore the case (for instance, "The" and "the"
are to be treated as the same word.) There should be no uppercase letters in the output.
For example, consider the following candidate for the input text:
This is a sample piece of text to illustrate this
problem.
The corresponding output would read as:
a
illustrate
is
of
piece
problem
sample
text
this
to
”
So i thought to use the concept of Trie or Prefix Tree to come out with a solution. Trie is
a special kind of Tree Data Structure which is very useful for Dictionary kind of
application. It has the in built properties of Prefix matching, neighbor search etc. The
Trie data structure is as shown below:
3. With the given problem, we first remove the punctuation and make all lowercase of the
input text. Then we can tokenize the string and get different words. We then put these
words in a Trie. After that if we traverse the Trie Preorderly, we will get the desired
output.
The whole solution looks like the following.
Class Trie
package com.somitsolutions.java.training.wordsindictionaryform;
public class Trie {
// Alphabet size (# of symbols)
static final int ALPHABET_SIZE = 26;
// trie node
static class TrieNode
{
TrieNode[] children = new TrieNode[ALPHABET_SIZE];
4. // isEndOfWord is true if the node represents
// end of a word
boolean isEndOfWord;
String key;//Store the
TrieNode(){
key = null;
isEndOfWord = false;
for (int i = 0; i < ALPHABET_SIZE; i++)
children[i] = null;
}
}
static TrieNode root;
// If not present, inserts key into trie
// If the key is prefix of trie node,
// just marks leaf node
static void insert(String key)
{
int level;
int length = key.length();
int index;
TrieNode pCrawl = root;
for (level = 0; level < length; level++)
{
index = key.charAt(level) - 'a';
if (pCrawl.children[index] == null)
pCrawl.children[index] = new TrieNode();
pCrawl = pCrawl.children[index];
}
// mark last node as leaf
pCrawl.isEndOfWord = true;
//Store the key at the leaf node
pCrawl.key = key;
}
//This will print out Lexicographically sorted output.
//To begin with, pass pCrawl as root
static void preorder(TrieNode pCrawl ){
//exit condition
if(pCrawl == null){
return;
5. }
for(int index = 0; index< 26 ;index++){
if(pCrawl.children[index] != null){
if(pCrawl.children[index].key != null){
System.out.println(pCrawl.children[index].key);
}
preorder(pCrawl.children[index]);
}
}
}
}
//Class Main
package com.somitsolutions.java.training.wordsindictionaryform;
import java.util.StringTokenizer;
public class Main {
static String keys[] = new String[1000];
public static void main(String[] args) {
// TODO Auto-generated method stub
String inputStr = "This is a lamb. This is white in color.";
String refinedStr = inputStr.replaceAll("[^a-zA-Z -]",
"").toLowerCase();
dictionary(refinedStr);
}
private static void dictionary(String in){
StringTokenizer st = new StringTokenizer(in, " ");
int i = 0;
Trie.root = new Trie.TrieNode();
while (st.hasMoreTokens()){
String strTemp = st.nextToken();
Trie.insert(strTemp);
i++;