SlideShare uma empresa Scribd logo
1 de 5
Baixar para ler offline
International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5



         A Dynamic Filtering Algorithm to Search Approximate String
                          1
                            A.Dasaradha, 2P.K.Sahu
                         1
                             Dept. of CSE, Final M.Tech Student, AITAM Engineering College, INDIA,
                         2
                             Dept. of CSE, Associative Professor, AITAM Engineering College, INDIA


Abstract:
Recently string data management in databases has gained             Efficiently as possible. Many techniques have been
lot of interest in various applications such as data                designed, such as [1], [2], [3], [4], [5], [6], [7]. These
cleaning, query relaxation and spellchecking. Hence in              methods assume a given similarity function to quantify
this paper we provide a solution how to find similar to a           the closeness between two strings. Various string-
query string from a given a collection of strings. The              similarity functions have been studied, such as edit
proposed solution has two phases. In the first phase three          distance, cosine similarity and Jaccard coefficient. All
algorithms such as ScanCount, MergeSkip, and                        these methods use the gram concept which is a substring
DivideSkip for answering approximate string search                  of a string to be used as a signature of the string. These
queries. In the second phase, we study on how to                    algorithms rely on inverted lists of grams to find
integrate various filtering techniques with the proposed            candidate strings, and utilize the fact that similar strings
merging algorithms. Several experiments have been                   should share enough common grams. Many algorithms
conducted on various data sets to calculate the                     [9], [2] [15] mainly focused on “join queries” i.e., finding
performance of the proposed techniques.                             similar pairs from two collections of strings.
                                                                    Approximate string search could be treated as a special
Introduction:                                                       case of join queries. It is well understood that the
Recently string data management in databases has gained             behavior of an algorithm for answering selection queries
lot of interest in text mining. Hence in this paper we              could be very different from that for answering join
study a problem how to find similar to a query string               queries. We believe approximate string search is
“approximate string search” from a given a collection of            important enough to deserve a separate investigation.
strings. This problem can be find in various applications
like data cleaning, query relaxation, and spellchecking.            In this paper the proposed solution has two phases. In the
                                                                    first phase, we propose three efficient algorithms for
Spell Checking: Given an input document, a spellchecker             answering approximate string search queries, called
has to find all possible mistyped words by searching                ScanCount, MergeSkip, and DivideSkip. The ScanCount
similar words to them in the dictionary. We have to find            algorithm adopts a simple idea of scanning the inverted
matched candidates to recommend for words which are                 lists and counting candidate strings. Despite the fact that
not there in the dictionary.                                        it is very naive, when combined with various filtering
                                                                    techniques, this algorithm can still achieve a high
Data Cleaning: A data collection has various
                                                                    performance. The MergeSkip algorithm exploits the
inconsistencies which have to be solved before the data
                                                                    value differences among the inverted lists and the
can be used for accurate data analysis. The process of
                                                                    threshold on the number of common grams of similar
detecting and correcting such inconsistencies is known as
                                                                    strings to skip many irrelevant candidates on the lists.
data cleaning. A common form of inconsistency arises
                                                                    The DivideSkip algorithm combines the MergeSkip
when a real-world entity has more than one
                                                                    algorithm and the idea in the MergeOpt algorithm
representation in the data collection; for example, the
                                                                    proposed in [9] that divides the lists into two groups. One
same address could be encoded using different strings in
                                                                    group is for those long lists, and the other group is for the
different records in the collection. Multiple
                                                                    remaining lists. We run the MergeSkip algorithm to
representations arise due to a variety of reasons such as
                                                                    merge the short lists with a different threshold, and use
misspellings caused by typographic errors and different
                                                                    the long lists to verify the candidates. Our experiments on
formatting conventions used by data sources.
                                                                    three real data sets showed that the proposed algorithms
These applications require a high real-time performance             could significantly improve the performance of existing
for each query to be answered. Hence it is necessary to             algorithms.
design algorithms for answering such queries as



Issn 2250-3005(online)                                          September| 2012                                Page 1215
International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5



In the second phase we study, how to integrate various             least T times among all the lists. This algorithm is based
filtering techniques with the proposed merging                     on the observation that a record in the answer must
algorithms. Various filters have been proposed to                  appear on at least one of the short lists. This algorithm is
eliminate strings that cannot be similar enough to a given         more efficient than Heap algorithm.
string. Surprisingly, our experiments and analysis show
that a naive solution of adopting all available filtering          Proposed Solution:
techniques might not achieve the best performance to               Here we present our three proposed merging algorithms.
merge inverted lists. Intuitively, filters can segment
inverted lists to relatively shorter lists, while merging          ScanCount:
algorithms need to merge these lists. In addition, the             In this algorithm [8], we maintain an array (S) of counts
more filters we apply, the more groups of inverted lists           for all the string ids. Then scan the inverted lists. For
we need to merge, and more overhead we need to spend               each string id on each list, we increment the count
for processing these groups before merging their lists.            corresponding to the string by 1. Report the string ids that
Thus filters and merging algorithms need to be integrated          appear at least T times in the lists. The time complexity
judiciously by considering this tradeoff. Based on this            of the algorithm is O(M) for heap algorithm it is
analysis, we classify filters into two categories: single-         O(MlogN) and The space complexity is O(|S|), where S is
signature filters and multi-signature filters. We propose a        the size of the string collection. ScanCount algorithm
strategy to selectively choose proper filters to build an          improves the Heap algorithm by eliminating the heap
index structure and integrate them with merging                    data structure and the corresponding operations on the
algorithms. Experiments show that our strategy reduces             heap. The algorithm is formally descried in Figure 1.
the running time by as much as one to two orders of
magnitude over approaches without filtering techniques
or strategies that naively use all the filtering techniques.

The remainder of this paper is organized as follows. In
section 2 discuss about related work, section 3 describes
about the proposed solution, section 4 explains the
experimental setup and section 5 concludes the paper.

Related Work:
Several existing algorithms assume an index of inverted
lists for the grams of the strings in the collection S to
answer approximate string queries on S. In the index, for
each gram g of the strings in S, we have a list lg of the ids
of the strings that include this gram, possibly with the
corresponding positional information of the gram in the
strings [12] [13] [14].

Heap algorithm [11]: When merging the lists, maintain
the frontiers of the lists as a heap. In each step, pop the
top element from the heap and increment the count of the
record id corresponding to the popped frontier record.
Remove this record id from this list, and reinsert the next
record id on the list to the heap. Report a record id
                                                                       Figure 1: Flowchart for ScanCount Algorithm
whenever its count value is at least threshold T. This
algorithm time complexity is O(MlogN) and space
                                                                   MergeSkip:
complexity is O(N).
                                                                   The main principle of this algorithm, is to skip on the
                                                                   lists those record ids that cannot be in the answer to the
MergeOpt algorithm [10]: It treats the T − 1 longest
                                                                   query, by utilizing the threshold T. Similar to Heap
inverted lists of G(Q, q) separately. For the remaining N
                                                                   algorithm, we also maintain a heap for the frontiers of
− (T − 1) relatively short inverted lists, Use the Heap
                                                                   these lists. The key difference is in each iteration, we pop
algorithm to merge them with a lower threshold i.e., 1.
                                                                   those records from the heap that have the same value as
For each candidate string, apply binary search on each of
                                                                   the top record t on the heap which is described in figure2.
the T −1 long lists to verify if the string appears on at
Issn 2250-3005(online)                                          September| 2012                              Page 1216
International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5



                                                                 Experimental Setup:
                                                                 The performance of five merging algorithms such as
                                                                 Heap, MergeOpt, ScanCount, MergeSkip, and
                                                                 DivideSkip has been evaluated using DBLP dataset.

                                                                 DBLP dataset: It includes paper titles downloaded from
                                                                 the DBLP Bibliography site1. The raw data was in an
                                                                 XML format, and we extracted 274,788 paper titles with
                                                                 a total size 17.8MB. The average size of gram inverted
                                                                 lists for a query was about 67, and the total number of
                                                                 distinct grams was 59, 940.

                                                                  The gram length q was 4 for the data sets. All the
                                                                 algorithms were implemented using GNU C++ and run
                                                                 on a system with 2GB main memory, a 2.13GHz Dual
                                                                 Core CPU and Ubuntu operating system.




    Figure 2: Flowchart for MergeSkip Algorithm

DivideSkip:
The key idea of DivideSkip algorithm is to combine
MergeSkip and MergeOpt algorithms. Both the
algorithms try to skip irrelevant records on the lists but
using different intuitions. MergeSkip exploits the value
differences among the records on the lists, while
MergeOpt exploits the size differences among the lists.               Figure 4: Average query time versus data set size.
DivideSkip algorithm uses both differences to improve
the search performance.




                                                                    Figure 5: Number of string ids visited by the
                                                                 algorithms.

                                                                 Classification Of Filters:
                                                                 A filter generates a set of signatures for a string, such that
                                                                 similar strings share similar signatures, and these
    Figure 3: Flow chart for DivideSkip algorithm                signatures can be used easily to build an index structure.
                                                                 Filters are classified into two categories. Single-signature
                                                                 filters generate a single signature (typically an integer or
                                                                 a hash code) for a string and Multi-signature filters
                                                                 generate multiple signatures for a string.
Issn 2250-3005(online)                                       September| 2012                                Page 1217
International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5



                                                                  position filter. Figure 7 and 8 shows the improved
Length Filtering: If two strings s1 and s2 are within edit        performance of the algorithm using filters in the DBLP
distance k, the difference between their lengths cannot           dataset.
exceed k. Thus, given a query string s1, we only need to
consider strings s2 in the data collection such that the
difference between |s1| and |s2| is not greater than k. This
is a Single signature filter as it generates a single
signature for a string.

Position Filtering: If two strings s1 and s2 are within
edit distance k, then a q-gram in s1 cannot correspond to
a q-gram in the other string that differs by more than k
positions. Thus, given a positional gram (i1, g1) in the
query string, we only need to consider the other
corresponding gram (i2, g2) in the data set, such that
|i1−i2| ≤ k. This is a multi-signature filter because it
produces a set of positional grams as signatures for a                      Figure 7: DBLP data set for Merge
string.

Prefix Filtering [10]: Given two q-gram sets G(s1) and
G(s2) for strings s1 and s2, we can fix an ordering O of
the universe from which all set elements are drawn. Let
p(n, s) denote the n-th prefix element in G(s) as per the
ordering O. For simplicity, p(1, s) is abbreviated as ps.
An important property is that, if |G(s1) ∩ G(s2)| ≥ T ,
then ps2 ≤ p(n, s1), where n = |s1| − T + 1.

Applying Filters Before Merging Lists:
All the existing filters can be grouped to improve the
search performance of merging algorithms. One way to
group them is to build a tree structure called as Filter                     Figure 8: DBLP data set for Total
Tree, in which each level corresponds to a filter which is
described in figure 6.
                                                                  Conclusion:
                                                                  In this paper we have proposed a solution for how to
                                                                  efficiently find a collection of strings those similar to a
                                                                  given string. We designed solution in two phases. In the
                                                                  first phase three algorithms such as ScanCount,
                                                                  MergeSkip and DivideSkip for answering approximate
                                                                  string search queries. In the second phase, we study on
                                                                  how to integrate various filtering techniques with the
                                                                  proposed merging algorithms. Several experiments have
                 Figure 6: A FilterTree                           been conducted on various data sets to calculate the
                                                                  performance of the proposed techniques.
It is very important to decide which filters should be used
on which levels. To improve the performance, use single-
signature filters at level 1 (close to the root) such as the
length filter and the prefix filter because each string in
the data set will be inserted to a single path, instead of
appearing in multiple paths. During a search, for these
filters we only need to traverse those paths on which the
candidate strings can appear. From level 2, we can add
those multi-signature ones, such as the gram filter and the

Issn 2250-3005(online)                                         September| 2012                             Page 1218
International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5



References:
[1] .   Arasu, V. Ganti, and R. Kaushik, “Efficient
        Exact       Set-Similarity Joins,” in VLDB, 2006,
        pp. 918–929.
[2]     R. Bayardo, Y. Ma, and R. Srikant, “Scaling up
        all-pairs     similarity    search,”   in    WWW
        Conference, 2007.
[3]     S. Chaudhuri, K. Ganjam, V. Ganti, and R.
        Motwani, “Robust and Efficient Fuzzy Match for
        Online Data Cleaning,” in SIGMOD, 2003, pp.
        313–324.
[4]     L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N.
        Koudas, S. Muthukrishnan, and D. Srivastava,
        “Approximate string joins in a database (almost)
        for free,” in VLDB, 2001, pp. 491–500.
[5]     C. Li, B. Wang, and X. Yang, “VGRAM:
        Improving performance of approximate queries
        on string collections using variable-length
        grams,” in Very Large Data Bases, 2007.
[6]     E. Sutinen and J. Tarhio, “On Using q-Grams
        Locations in Approximate String Matching,” in
        ESA, 1995, pp. 327–340.
[7]     E. Ukkonen, “Approximae String Matching with
        q-Grams and Maximal Matching,” Theor. Comut.
        Sci., vol. 1, pp. 191–211, 1992.
[8]     V. Levenshtein, “Binary Codes Capable of
        Correcting Spurious Insertions and Deletions of
        Ones,” Profl. Inf. Transmission, vol. 1, pp. 8–17,
        1965.
[9]     S. Sarawagi and A. Kirpal, “Efficient set joins on
        similarity predicate,” in ACM SIGMOD, 2004.
[10]    S. Chaudhuri, V. Ganti, and R. Kaushik, “A
        primitive operator for similarity joins in data
        cleaning,” in ICDE, 2006, pp. 5–16.
[11]    G. Navarro, “A guided tour to approximate string
        matching,” ACM Computing Surveys, vol. 33,
        no. 1, pp. 31–88, 2001.
[12]    K. Ramasamy, J. M. Patel, R. Kaushik, and J. F.
        Naughton, “Set containment joins: The good, the
        bad and the ugly,” in VLDB, 2000.
[13]     N. Koudas, S. Sarawagi, and D. Srivastava,
        “Record linkage: similarity measures and
        algorithms,” in SIGMOD Tutorial, 2005, pp.
        802–803.
[14]    M.-S. Kim, K.-Y. Whang, J.-G. Lee, and M.-J.
        Lee, “n-Gram/2L: A space and time efficient
        two-level n-gram inverted index structure.” In
        VLDB, 2005, pp. 325–336.
[15]    Chen Li, Jiaheng Lu, Yiming Lu, “Efficient
        Merging and           Filtering    Algorithms  for
        Approximate String Searches”, ICDE 2008,
        IEEE.


Issn 2250-3005(online)                                       September| 2012                             Page 1219

Mais conteúdo relacionado

Mais procurados

Ap Power Point Chpt9
Ap Power Point Chpt9Ap Power Point Chpt9
Ap Power Point Chpt9
dplunkett
 
Ap Power Point Chpt6
Ap Power Point Chpt6Ap Power Point Chpt6
Ap Power Point Chpt6
dplunkett
 
Data Structure
Data StructureData Structure
Data Structure
sheraz1
 
14. Query Optimization in DBMS
14. Query Optimization in DBMS14. Query Optimization in DBMS
14. Query Optimization in DBMS
koolkampus
 

Mais procurados (20)

A practical parser with combined parsing
A practical parser with combined parsingA practical parser with combined parsing
A practical parser with combined parsing
 
final
finalfinal
final
 
Basic data-structures-v.1.1
Basic data-structures-v.1.1Basic data-structures-v.1.1
Basic data-structures-v.1.1
 
K mer index of dna sequence based on hash
K mer index of dna sequence based on hashK mer index of dna sequence based on hash
K mer index of dna sequence based on hash
 
Sharbani bhattacharya VB Structures
Sharbani bhattacharya VB StructuresSharbani bhattacharya VB Structures
Sharbani bhattacharya VB Structures
 
Data structures
Data structuresData structures
Data structures
 
Linked List vs. AVL Tree in Modulo Ten Search
Linked List vs. AVL Tree in Modulo Ten SearchLinked List vs. AVL Tree in Modulo Ten Search
Linked List vs. AVL Tree in Modulo Ten Search
 
Binary search
Binary searchBinary search
Binary search
 
Ap Power Point Chpt9
Ap Power Point Chpt9Ap Power Point Chpt9
Ap Power Point Chpt9
 
Ap Power Point Chpt6
Ap Power Point Chpt6Ap Power Point Chpt6
Ap Power Point Chpt6
 
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERINGCOMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
 
Data Structure
Data StructureData Structure
Data Structure
 
14. Query Optimization in DBMS
14. Query Optimization in DBMS14. Query Optimization in DBMS
14. Query Optimization in DBMS
 
Query trees
Query treesQuery trees
Query trees
 
Prelim Project OOP
Prelim Project OOPPrelim Project OOP
Prelim Project OOP
 
Binary Search
Binary SearchBinary Search
Binary Search
 
IRJET- A Survey on Different Searching Algorithms
IRJET- A Survey on Different Searching AlgorithmsIRJET- A Survey on Different Searching Algorithms
IRJET- A Survey on Different Searching Algorithms
 
Lecture 1 and 2
Lecture 1 and 2Lecture 1 and 2
Lecture 1 and 2
 
Binary search python
Binary search pythonBinary search python
Binary search python
 
LectureNotes-06-DSA
LectureNotes-06-DSALectureNotes-06-DSA
LectureNotes-06-DSA
 

Destaque

«Applied cryptanalysis stream ciphers» by Vladimir Garbuz
«Applied cryptanalysis stream ciphers» by Vladimir Garbuz «Applied cryptanalysis stream ciphers» by Vladimir Garbuz
«Applied cryptanalysis stream ciphers» by Vladimir Garbuz
0xdec0de
 
Using Algorithms to Brute Force Algorithms...A Journey Through Time and Names...
Using Algorithms to Brute Force Algorithms...A Journey Through Time and Names...Using Algorithms to Brute Force Algorithms...A Journey Through Time and Names...
Using Algorithms to Brute Force Algorithms...A Journey Through Time and Names...
OpenDNS
 

Destaque (6)

«Applied cryptanalysis stream ciphers» by Vladimir Garbuz
«Applied cryptanalysis stream ciphers» by Vladimir Garbuz «Applied cryptanalysis stream ciphers» by Vladimir Garbuz
«Applied cryptanalysis stream ciphers» by Vladimir Garbuz
 
Secure Hashing Techniques - Introduction
Secure Hashing Techniques - IntroductionSecure Hashing Techniques - Introduction
Secure Hashing Techniques - Introduction
 
Using Algorithms to Brute Force Algorithms...A Journey Through Time and Names...
Using Algorithms to Brute Force Algorithms...A Journey Through Time and Names...Using Algorithms to Brute Force Algorithms...A Journey Through Time and Names...
Using Algorithms to Brute Force Algorithms...A Journey Through Time and Names...
 
Hash Functions, the MD5 Algorithm and the Future (SHA-3)
Hash Functions, the MD5 Algorithm and the Future (SHA-3)Hash Functions, the MD5 Algorithm and the Future (SHA-3)
Hash Functions, the MD5 Algorithm and the Future (SHA-3)
 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
 
32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your Business32 Ways a Digital Marketing Consultant Can Help Grow Your Business
32 Ways a Digital Marketing Consultant Can Help Grow Your Business
 

Semelhante a IJCER (www.ijceronline.com) International Journal of computational Engineering research

Substructure Similarity Search in Graph Databases
Substructure Similarity Search in Graph DatabasesSubstructure Similarity Search in Graph Databases
Substructure Similarity Search in Graph Databases
pgst
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
BRNSSPublicationHubI
 
Cohesive Software Design
Cohesive Software DesignCohesive Software Design
Cohesive Software Design
ijtsrd
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution
Alexander Decker
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...
Akram Pasha
 

Semelhante a IJCER (www.ijceronline.com) International Journal of computational Engineering research (20)

Substructure Similarity Search in Graph Databases
Substructure Similarity Search in Graph DatabasesSubstructure Similarity Search in Graph Databases
Substructure Similarity Search in Graph Databases
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run time
 
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
Hadoop Map-Reduce To Generate Frequent Item Set on Large Datasets Using Impro...
 
ppt on arrays in c programming language.pptx
ppt on arrays in c programming language.pptxppt on arrays in c programming language.pptx
ppt on arrays in c programming language.pptx
 
D0352630
D0352630D0352630
D0352630
 
Cohesive Software Design
Cohesive Software DesignCohesive Software Design
Cohesive Software Design
 
ADAPTIVE AUTOMATA FOR GRAMMAR BASED TEXT COMPRESSION
ADAPTIVE AUTOMATA FOR GRAMMAR BASED TEXT COMPRESSIONADAPTIVE AUTOMATA FOR GRAMMAR BASED TEXT COMPRESSION
ADAPTIVE AUTOMATA FOR GRAMMAR BASED TEXT COMPRESSION
 
Query optimization to improve performance of the code execution
Query optimization to improve performance of the code executionQuery optimization to improve performance of the code execution
Query optimization to improve performance of the code execution
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 
K017367680
K017367680K017367680
K017367680
 
A survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information RetrievalA survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information Retrieval
 
(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
Hybrid approach for generating non overlapped substring using genetic algorithm
Hybrid approach for generating non overlapped substring using genetic algorithmHybrid approach for generating non overlapped substring using genetic algorithm
Hybrid approach for generating non overlapped substring using genetic algorithm
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
 
Intruction to Algorithms.pptx
Intruction to Algorithms.pptxIntruction to Algorithms.pptx
Intruction to Algorithms.pptx
 
Ijartes v1-i2-006
Ijartes v1-i2-006Ijartes v1-i2-006
Ijartes v1-i2-006
 
K044055762
K044055762K044055762
K044055762
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

IJCER (www.ijceronline.com) International Journal of computational Engineering research

  • 1. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5 A Dynamic Filtering Algorithm to Search Approximate String 1 A.Dasaradha, 2P.K.Sahu 1 Dept. of CSE, Final M.Tech Student, AITAM Engineering College, INDIA, 2 Dept. of CSE, Associative Professor, AITAM Engineering College, INDIA Abstract: Recently string data management in databases has gained Efficiently as possible. Many techniques have been lot of interest in various applications such as data designed, such as [1], [2], [3], [4], [5], [6], [7]. These cleaning, query relaxation and spellchecking. Hence in methods assume a given similarity function to quantify this paper we provide a solution how to find similar to a the closeness between two strings. Various string- query string from a given a collection of strings. The similarity functions have been studied, such as edit proposed solution has two phases. In the first phase three distance, cosine similarity and Jaccard coefficient. All algorithms such as ScanCount, MergeSkip, and these methods use the gram concept which is a substring DivideSkip for answering approximate string search of a string to be used as a signature of the string. These queries. In the second phase, we study on how to algorithms rely on inverted lists of grams to find integrate various filtering techniques with the proposed candidate strings, and utilize the fact that similar strings merging algorithms. Several experiments have been should share enough common grams. Many algorithms conducted on various data sets to calculate the [9], [2] [15] mainly focused on “join queries” i.e., finding performance of the proposed techniques. similar pairs from two collections of strings. Approximate string search could be treated as a special Introduction: case of join queries. It is well understood that the Recently string data management in databases has gained behavior of an algorithm for answering selection queries lot of interest in text mining. Hence in this paper we could be very different from that for answering join study a problem how to find similar to a query string queries. We believe approximate string search is “approximate string search” from a given a collection of important enough to deserve a separate investigation. strings. This problem can be find in various applications like data cleaning, query relaxation, and spellchecking. In this paper the proposed solution has two phases. In the first phase, we propose three efficient algorithms for Spell Checking: Given an input document, a spellchecker answering approximate string search queries, called has to find all possible mistyped words by searching ScanCount, MergeSkip, and DivideSkip. The ScanCount similar words to them in the dictionary. We have to find algorithm adopts a simple idea of scanning the inverted matched candidates to recommend for words which are lists and counting candidate strings. Despite the fact that not there in the dictionary. it is very naive, when combined with various filtering techniques, this algorithm can still achieve a high Data Cleaning: A data collection has various performance. The MergeSkip algorithm exploits the inconsistencies which have to be solved before the data value differences among the inverted lists and the can be used for accurate data analysis. The process of threshold on the number of common grams of similar detecting and correcting such inconsistencies is known as strings to skip many irrelevant candidates on the lists. data cleaning. A common form of inconsistency arises The DivideSkip algorithm combines the MergeSkip when a real-world entity has more than one algorithm and the idea in the MergeOpt algorithm representation in the data collection; for example, the proposed in [9] that divides the lists into two groups. One same address could be encoded using different strings in group is for those long lists, and the other group is for the different records in the collection. Multiple remaining lists. We run the MergeSkip algorithm to representations arise due to a variety of reasons such as merge the short lists with a different threshold, and use misspellings caused by typographic errors and different the long lists to verify the candidates. Our experiments on formatting conventions used by data sources. three real data sets showed that the proposed algorithms These applications require a high real-time performance could significantly improve the performance of existing for each query to be answered. Hence it is necessary to algorithms. design algorithms for answering such queries as Issn 2250-3005(online) September| 2012 Page 1215
  • 2. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5 In the second phase we study, how to integrate various least T times among all the lists. This algorithm is based filtering techniques with the proposed merging on the observation that a record in the answer must algorithms. Various filters have been proposed to appear on at least one of the short lists. This algorithm is eliminate strings that cannot be similar enough to a given more efficient than Heap algorithm. string. Surprisingly, our experiments and analysis show that a naive solution of adopting all available filtering Proposed Solution: techniques might not achieve the best performance to Here we present our three proposed merging algorithms. merge inverted lists. Intuitively, filters can segment inverted lists to relatively shorter lists, while merging ScanCount: algorithms need to merge these lists. In addition, the In this algorithm [8], we maintain an array (S) of counts more filters we apply, the more groups of inverted lists for all the string ids. Then scan the inverted lists. For we need to merge, and more overhead we need to spend each string id on each list, we increment the count for processing these groups before merging their lists. corresponding to the string by 1. Report the string ids that Thus filters and merging algorithms need to be integrated appear at least T times in the lists. The time complexity judiciously by considering this tradeoff. Based on this of the algorithm is O(M) for heap algorithm it is analysis, we classify filters into two categories: single- O(MlogN) and The space complexity is O(|S|), where S is signature filters and multi-signature filters. We propose a the size of the string collection. ScanCount algorithm strategy to selectively choose proper filters to build an improves the Heap algorithm by eliminating the heap index structure and integrate them with merging data structure and the corresponding operations on the algorithms. Experiments show that our strategy reduces heap. The algorithm is formally descried in Figure 1. the running time by as much as one to two orders of magnitude over approaches without filtering techniques or strategies that naively use all the filtering techniques. The remainder of this paper is organized as follows. In section 2 discuss about related work, section 3 describes about the proposed solution, section 4 explains the experimental setup and section 5 concludes the paper. Related Work: Several existing algorithms assume an index of inverted lists for the grams of the strings in the collection S to answer approximate string queries on S. In the index, for each gram g of the strings in S, we have a list lg of the ids of the strings that include this gram, possibly with the corresponding positional information of the gram in the strings [12] [13] [14]. Heap algorithm [11]: When merging the lists, maintain the frontiers of the lists as a heap. In each step, pop the top element from the heap and increment the count of the record id corresponding to the popped frontier record. Remove this record id from this list, and reinsert the next record id on the list to the heap. Report a record id Figure 1: Flowchart for ScanCount Algorithm whenever its count value is at least threshold T. This algorithm time complexity is O(MlogN) and space MergeSkip: complexity is O(N). The main principle of this algorithm, is to skip on the lists those record ids that cannot be in the answer to the MergeOpt algorithm [10]: It treats the T − 1 longest query, by utilizing the threshold T. Similar to Heap inverted lists of G(Q, q) separately. For the remaining N algorithm, we also maintain a heap for the frontiers of − (T − 1) relatively short inverted lists, Use the Heap these lists. The key difference is in each iteration, we pop algorithm to merge them with a lower threshold i.e., 1. those records from the heap that have the same value as For each candidate string, apply binary search on each of the top record t on the heap which is described in figure2. the T −1 long lists to verify if the string appears on at Issn 2250-3005(online) September| 2012 Page 1216
  • 3. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5 Experimental Setup: The performance of five merging algorithms such as Heap, MergeOpt, ScanCount, MergeSkip, and DivideSkip has been evaluated using DBLP dataset. DBLP dataset: It includes paper titles downloaded from the DBLP Bibliography site1. The raw data was in an XML format, and we extracted 274,788 paper titles with a total size 17.8MB. The average size of gram inverted lists for a query was about 67, and the total number of distinct grams was 59, 940. The gram length q was 4 for the data sets. All the algorithms were implemented using GNU C++ and run on a system with 2GB main memory, a 2.13GHz Dual Core CPU and Ubuntu operating system. Figure 2: Flowchart for MergeSkip Algorithm DivideSkip: The key idea of DivideSkip algorithm is to combine MergeSkip and MergeOpt algorithms. Both the algorithms try to skip irrelevant records on the lists but using different intuitions. MergeSkip exploits the value differences among the records on the lists, while MergeOpt exploits the size differences among the lists. Figure 4: Average query time versus data set size. DivideSkip algorithm uses both differences to improve the search performance. Figure 5: Number of string ids visited by the algorithms. Classification Of Filters: A filter generates a set of signatures for a string, such that similar strings share similar signatures, and these Figure 3: Flow chart for DivideSkip algorithm signatures can be used easily to build an index structure. Filters are classified into two categories. Single-signature filters generate a single signature (typically an integer or a hash code) for a string and Multi-signature filters generate multiple signatures for a string. Issn 2250-3005(online) September| 2012 Page 1217
  • 4. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5 position filter. Figure 7 and 8 shows the improved Length Filtering: If two strings s1 and s2 are within edit performance of the algorithm using filters in the DBLP distance k, the difference between their lengths cannot dataset. exceed k. Thus, given a query string s1, we only need to consider strings s2 in the data collection such that the difference between |s1| and |s2| is not greater than k. This is a Single signature filter as it generates a single signature for a string. Position Filtering: If two strings s1 and s2 are within edit distance k, then a q-gram in s1 cannot correspond to a q-gram in the other string that differs by more than k positions. Thus, given a positional gram (i1, g1) in the query string, we only need to consider the other corresponding gram (i2, g2) in the data set, such that |i1−i2| ≤ k. This is a multi-signature filter because it produces a set of positional grams as signatures for a Figure 7: DBLP data set for Merge string. Prefix Filtering [10]: Given two q-gram sets G(s1) and G(s2) for strings s1 and s2, we can fix an ordering O of the universe from which all set elements are drawn. Let p(n, s) denote the n-th prefix element in G(s) as per the ordering O. For simplicity, p(1, s) is abbreviated as ps. An important property is that, if |G(s1) ∩ G(s2)| ≥ T , then ps2 ≤ p(n, s1), where n = |s1| − T + 1. Applying Filters Before Merging Lists: All the existing filters can be grouped to improve the search performance of merging algorithms. One way to group them is to build a tree structure called as Filter Figure 8: DBLP data set for Total Tree, in which each level corresponds to a filter which is described in figure 6. Conclusion: In this paper we have proposed a solution for how to efficiently find a collection of strings those similar to a given string. We designed solution in two phases. In the first phase three algorithms such as ScanCount, MergeSkip and DivideSkip for answering approximate string search queries. In the second phase, we study on how to integrate various filtering techniques with the proposed merging algorithms. Several experiments have Figure 6: A FilterTree been conducted on various data sets to calculate the performance of the proposed techniques. It is very important to decide which filters should be used on which levels. To improve the performance, use single- signature filters at level 1 (close to the root) such as the length filter and the prefix filter because each string in the data set will be inserted to a single path, instead of appearing in multiple paths. During a search, for these filters we only need to traverse those paths on which the candidate strings can appear. From level 2, we can add those multi-signature ones, such as the gram filter and the Issn 2250-3005(online) September| 2012 Page 1218
  • 5. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5 References: [1] . Arasu, V. Ganti, and R. Kaushik, “Efficient Exact Set-Similarity Joins,” in VLDB, 2006, pp. 918–929. [2] R. Bayardo, Y. Ma, and R. Srikant, “Scaling up all-pairs similarity search,” in WWW Conference, 2007. [3] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and Efficient Fuzzy Match for Online Data Cleaning,” in SIGMOD, 2003, pp. 313–324. [4] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, “Approximate string joins in a database (almost) for free,” in VLDB, 2001, pp. 491–500. [5] C. Li, B. Wang, and X. Yang, “VGRAM: Improving performance of approximate queries on string collections using variable-length grams,” in Very Large Data Bases, 2007. [6] E. Sutinen and J. Tarhio, “On Using q-Grams Locations in Approximate String Matching,” in ESA, 1995, pp. 327–340. [7] E. Ukkonen, “Approximae String Matching with q-Grams and Maximal Matching,” Theor. Comut. Sci., vol. 1, pp. 191–211, 1992. [8] V. Levenshtein, “Binary Codes Capable of Correcting Spurious Insertions and Deletions of Ones,” Profl. Inf. Transmission, vol. 1, pp. 8–17, 1965. [9] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicate,” in ACM SIGMOD, 2004. [10] S. Chaudhuri, V. Ganti, and R. Kaushik, “A primitive operator for similarity joins in data cleaning,” in ICDE, 2006, pp. 5–16. [11] G. Navarro, “A guided tour to approximate string matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31–88, 2001. [12] K. Ramasamy, J. M. Patel, R. Kaushik, and J. F. Naughton, “Set containment joins: The good, the bad and the ugly,” in VLDB, 2000. [13] N. Koudas, S. Sarawagi, and D. Srivastava, “Record linkage: similarity measures and algorithms,” in SIGMOD Tutorial, 2005, pp. 802–803. [14] M.-S. Kim, K.-Y. Whang, J.-G. Lee, and M.-J. Lee, “n-Gram/2L: A space and time efficient two-level n-gram inverted index structure.” In VLDB, 2005, pp. 325–336. [15] Chen Li, Jiaheng Lu, Yiming Lu, “Efficient Merging and Filtering Algorithms for Approximate String Searches”, ICDE 2008, IEEE. Issn 2250-3005(online) September| 2012 Page 1219