Introduction to Text Analytics algorithmn and Support Vector Machines (SVM) for modelling Text Analytics applications. Incl. Who is Treparel / Introduction to Text Mining / What is automated Classification and Clustering / Support Vector Machines, SVM
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Support Vector Machines (SVM) - Text Analytics algorithm introduction 2012
1. Introduction to
Text Mining
& Support
Vector Machines
(SVM)
Dr. Anton Heijs
CEO
Treparel
Delftechpark 26
2628 XH Delft July 2012
The Netherlands
www.treparel.com
2. KMX enables information and knowledge professionals
to gain faster, reliable, more precise insights in large
complex unstructured data sets allowing them to make
better informed decisions.
Treparel is a leading technology solution provider in
Big Data Text Analytics & Visualization
Treparel KMX – All rights reserved 2012 www.treparel.com 2
3. Topics covered in this presentation
• Who is Treparel?
• Introduction in Text Mining
• What is Automated Classification & Clustering?
• Introducing Support Vector Machines
Treparel KMX – All rights reserved 2012 www.treparel.com 3
4. Nexus of Forces: Social, Cloud, Mobile, Information
IT Market shift driving Big Data challenges
Copyright: Gartner, 2011
80% of data is Unstructured (Documents, Text, Images, Graphs)
Treparel KMX – All rights reserved 2012 www.treparel.com 4
5. About Treparel
• Delft, The Netherlands, 2006.
• Treparel is an innovative technology solution provider in Big Data
Analytics, Text Mining and Visualization.
• KMX is an integrated data analysis toolset which provide faster,
reliable intelligent insights in large complex unstructured data sets to
allow companies to make better informed decisions.
• Clients: Philips, Bayer, Abbott, European Patent Office, European
Commission
• Part of Research Centers and University ecosystem; TU Delft,
Universities of Paris and Sao Paulo
• More info: www.treparel.com
Treparel KMX – All rights reserved 2012 www.treparel.com 5
6. Positioning of Treparel’s KMX technology
Text Acquisition & Preparation Analysis and processing Output and display
‘Seek’ ‘Model’ ‘Adapt’
External sources Reporting &
Text preprocessing
Patents Presentation
Legal
Media and publishing
Research Indexing databases
Media / Publishers
Content management
Other sources Clustering systems
Documents
Websites Line-of-business
Classification applications
Blogs
Newsfeeds Research applications
Email Semantic Analysis
Application notes Search engines
Search results
Social networks Visualization
Information extraction (entities, facts, relationships, concepts, patents)
Management, Development and Configuration
Copyright: Gartner, J. Popkin 2010
7. Getting to know the basics
PART A: Intro in Text Mining
• The Data (text & image) Mining evolution
• What is Data Mining: in or out-side the database
• The Data Mining process
• Two types of Data Mining tasks: Predictive and Descriptive
• Two modes of Data Mining tasks: Supervised and Unsupervised
• The most important algorithms per category
PART B: SVM
• Machine Learning & Support Vector Machines (SVM)
• What makes SVM unique
• When and How to deploy SVM
• Case Studies & Examples
Treparel KMX – All rights reserved 2012 www.treparel.com 7
8. The Data/Text/Image mining evolution
The Road ahead
Future
High Enterprise
Today Text Analytics
Analytical
Modeling
1995 - 2000
SVM
Predictive
Modeling
Application Value
1980’s
Traditional
“Easy-to-Use”
Data Mining
Data Mining
Tools
1980’s
1990’s
OLAP Query and
Reporting
Low
Hard to use Easy to Use
Usability
Treparel KMX – All rights reserved 2012 www.treparel.com 8
9. Knowledge Mining
Different levels of depth in knowledge discovery
Visualization (Adapt)
Models of semantic data
Models of data
Models of meta data
Data Mining Knowledge
Filtered data
Text Mining Discovery
Meta Data Graph Mining
Data Collection (Seek)
Time
Treparel KMX – All rights reserved 2012 www.treparel.com 9
10. What is Data Mining?
Getting to know the basics
• Most businesses have an enormous amount of data, with a great deal of
information hiding within it; The data is also growing faster then the knowledge
which is now extracted from the data, which leads to a growing gap between
data and knowledge.
• Data mining provides a way to automatically extract information buried in the
data.
• Data Mining creates mathematical models which describe patterns in large,
complex collections of data.
• Patterns elude traditional statistical approaches to analysis because of the large
number of attributes, the complexity of the patterns, or the difficulty to perform
the analysis
• Mining the data directly in the database has advantages:
less data movement, more data security, one source of the
data
• Basically 2 Types of Data exist:
– Structured (tables & numbers) – 20% of data volume
– Un-Structured (text, images) - 80% of data volume
Treparel KMX – All rights reserved 2012 www.treparel.com 10
11. The Data & Text Mining process
Automating the mining steps; adding new features
Understanding the knowledge mining value chain
Data Model
Data Preparation Algorithm Model Model generation
& De- (All models) & Visualization
Collection & Selection Building
Understanding Cleansing & Testing ployment coordination
Treparel's Focus
& Core competence
Traditional Players
Treparel KMX – All rights reserved 2012
12. 2 types of Data Mining Functions
Predictive Data Mining (supervised):
• Are used to predict a value; they require the specification of a
target (known outcome)
• Targets are either binary attributes (indicating yes/no) decisions or
multi-class targets indicating a preferred alternative (color of
sweater, salary range).
• Constructs one or more models; these models are used to predict
outcomes for data sets
Descriptive Data Mining (Unsupervised):
• Are used to find the intrinsic structure, relations, or affinities in
data.
• Describes a data set in a concise way and presents interesting
characteristics of the data
• The functions are: clustering, association models, and feature
extraction
Treparel KMX – All rights reserved 2012 www.treparel.com 12
13. How does Automated Classification & Clustering
works?
• Consists of dividing the items that make up a collection into
categories or classes.
• The goal is to accurately predict the target class for each
record in new data.
• Algorithms for classification: different algorithms for
different problems
Naïve Bayes
Adaptive Bayes Network
Support Vector Machine
Decision Tree
Classification is used in: customer segmentation, sentiment
analysis, competitive analysis, business modeling, credit
analysis, Smart content, Fraud and terrorist detection,
Diagnosis support, Patent & Drug discovery
Treparel KMX – All rights reserved 2012 www.treparel.com 13
14. Text Mining algorithms and features
Feature Naive Bayes Adaptive Suport Vector Decision Tree
Bayes Machine
Network
Speed Very fast Fast Fast with Fast
active learning
Accuracy Good in many Good in many Significant Good in many
domains domains domains
Transparancy No rules (black Rules for No rules (black Rules
box) box)
Missing value Missing value Missing value Sparse Data Missing value
intrepretation
Treparel KMX – All rights reserved 2012 www.treparel.com 14
15. What is Support Vector Machine Learning?
State of the Art algorithm
• SVM is a state of the art classification and regression algorithm
• The SVM optimization procedure maximizes predictive accuracy
while automatically avoiding over-fitting the training data
• SVM projects the input data into a kernel space. Then it builds a
linear model in this kernel space
• SVM performs well with real world applications such as
classifying text, recognizing hand-written characters, classifying
images, as well as bioinformatics and bio sequence analysis.
• SVM are the standard tools for machine learning and data mining
Treparel KMX – All rights reserved 2012 www.treparel.com 15
16. What is Support Vector Machine Learning?
Classical Data Mining vs SVM
Classical Statistics SVM - Support Vector Machines
Hypothesis on Data Study of the model family:
distribution the VC dimension
Large number of dimensions Number of dimensions can be
implies large number of model very high because generalization
parameters which leads to is controlled
generalization problems
Modeling seeks to get the best Modeling seeks to get the best
Fit compromise between Fit and
Robustness
Manual iterations and time Automation is possible
are necessary
Treparel KMX –
All rights
reserved 2012
17. What makes SVM such a unique technology?
• Strong theoretical foundation (Vapnik-Chervonenkis theory)
• There is no upper limit on the number of attributes ; Only constraint is
the hardware
• Good generalization to novel data
• SVM is the preferred algorithm for sparse data
• Algorithm of choice for challenging high-dimensional data
• SVM supports active learning.
– SVM models grow as the size of the training set increases, big data
sets would be difficult to handle.
– Aative learning forces the SVM algorithm to restrict learning to the
most informative training examples.
• SVM automatically selects a kernel
• You can control both the model quality (accuracy) and the performance
(build time)
Treparel KMX – All rights reserved 2012 www.treparel.com 17
18. What makes SVM unique?
SVM gives you control over the models
Robustness
High
Robustness
Under Fit Model Robust Model
High Robustness Low Training Error Low Test
Training Error = Test Error Error
Low Over Fit Model
Robustness
Low Robustness
No Training Error, High Test
Error
Low accuracy High accuracy
Quality of fit
Treparel KMX – All rights reserved 2012 www.treparel.com 18
19. What makes SVM unique?
SVM gives you control over the models
Need more training data Safe to Deploy
High
Robustness
(rows)
Need more data
Need more variables
(rows/columns)
Low
(columns) or different model
or different model type type
Low High
Quality
Treparel KMX – All rights reserved 2012 www.treparel.com 19
20. Treparel is a leading technology solution provider
in Big Data Text Analytics & Visualization
Treparel
Delftechpark 26
2628 XH Delft
The Netherlands
www.treparel.com
Treparel KMX – All rights reserved 2012 www.treparel.com 20