Mining sequential patterns matching over high utility data sets
1. Base Paper Title:
Record Matching over Query Results
From Multiple Web Databases
Modified Title:
Mining sequential patterns matching over high utility data sets.
Abstract:
Record matching, which identifies the records that represent the same
real-world entity, is an important step for data integration. Most state-of-
the-art record matching methods are supervised, which requires the user to
provide training data. These methods are not applicable for the Web
database scenario, where the records to match are query results
dynamically generated onthe- fly. Such records are query-dependent and a
prelearned method using training examples from previous query results
may fail on the results of a new query. To address the problem of record
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
2. matching in the Web database scenario, we present an unsupervised,online
record matching method, UDD, which, for a given query, can effectively
identify duplicates from the query result records of multiple Web
databases. After removal of the same-source duplicates, the “presumed”
non duplicate records from the same source can be used as training
examples alleviating the burden of users having to manually label training
examples. Starting from the non duplicate set, we use two cooperating
classifiers, a weighted component similarity summing classifier and an SVM
classifier, to iteratively identify duplicates in the query results from
multiple Web databases. Experimental results show that UDD works well
for the Web database scenario where existing supervised methods do not
apply.
Existing System:
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
3. • Relational database systems
• All web data base (unknown user are easy to destroy the data
base)
Proposed System:
• False data can discover the actions when unauthorized users
attempted to access computer systems or authorized users attempted
to misuse their privileges.
• Association rule mining
• An algorithm based on sequential pattern mining using the same data
collected by the Databases.
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
4. Our Proposed Work apart from Base paper:
Sequential pattern mining
a. Apriori-like methods(gsp)
b. Pattern-growth methods(Free Span, Prefix Span)
Hardware Specification
Processor Type : Pentium -III
Speed : 1.6 GHZ
Ram : 128 MB RAM
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
5. Hard disk : 8 GB HD
Software Specification
Operating System : Linux / Windows
Programming Package : JAVA
Tools : Eclipse, Weka Data Mining Tools.
Data Base : MySQL
SDK : JDK1.5.0
Algorithm:
• Association rule mining
o Find large item sets for a given minsup, and
o Compute rules for a given minconf based on the item sets
obtained before.
• Sequential pattern mining
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com
6. • UDD Algorithm
• component weight assignment algorithm
Modules:
1. Analysis and design of Data sets /items:
2. Data preprocessing
3. sequential pattern mining
4. Record matching with web data base
5. Performance analysis
Ambit lick Solutions
Mail Id : Ambitlick@gmail.com , Ambitlicksolutions@gmail.Com