BigInsights and Text Analytics.
As enterprises seek to gain operational efficiencies and competitive advantage through greater use of analytics, much of the new information they need to analyze is found in text documents and, increasingly, in a wide variety of social media sites and portals. A critical step in gaining insights from this information is extracting core data from huge volumes of text. That data is then available for downstream analytic, mining and machine learning tools. AQL (Annotator Query Language) is a powerful declarative, rule-based language for the extraction of information from text documents.
2. Scenario
Source and analyze blogs and news articles about a popular
brand or service across various social media sites
−
“IBM Watson”
−
Analytics include
Watson applications by industry and within an industry
Watson association with Jeopardy!
Simple sentiment/tone scoring
3. Scenario
Process
−
Collect data
−
Transform and subset
−
Develop and test a Text Analytics extractor using Eclipse
−
Publish and deploy the extractor to a BigInsights cluster.
−
Apply the Text Analytics extractor from BigSheets
−
Analyze and chart the results
4. Text Analytics
Identify and extract structured information from unstructured
and semi-structured text
To enable analytics
−
chart, report, join, aggregate, slice, dice and drill, model, mine…
5. Text Analytics
80% of the world’s data is unstructured or semi-structured text
Social media is rife with information about products and services
−
Discussions, blogs, tweets…
Applications often lock up useful information in blobs, description fields and
semi-structured records that are difficult or impossible to open up for
analysis
−
Call center records, log files…
How do you get a metrics based understanding of facts from unstructured
text?
I had an iphone, but it's dead
I had an iphone, but it's dead
@JoaoVianaa.
@JoaoVianaa.
(I've no idea where it's) !Want a
(I've no idea where it's) !Want a
blackberry now !!!
blackberry now !!!
@rakonturmiami im moving to miami
@rakonturmiami im moving to miami
in 3 months.
in 3 the new
i look foward to months. lifestyle
i look foward to the new lifestyle
I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Des
I'm at Mickey's Irish 2 others http://4sq.com/gbsaYR Ave, Des
Moines) w/ Pub Downtown (206 3rd St, Court
Moines) w/ 2 others http://4sq.com/gbsaYR
6. BigInsights & Streams Text Analytics
High Performance rule based Information Extraction Engine
Highly scalable solution for at-rest and in-motion analytics
Pre-built extractors, and toolkit to build custom Extractors
Declarative Information Extraction (IE) system based on an
algebraic framework
Sophisticated tooling to help build, test, and refine rules
Developed at IBM Research since 2004
Embedded in several IBM products
7. Applications of Text analytics
Broad range of applications in many industries
−
CRM Analytics - Voice of customer, Product and Services
gap analysis, Customer churn
−
Social Media Analytics - Purchase intent, Customer churn
prediction, Reputational Risk
−
Digital Piracy - illegal broadcast of streaming and
video content
−
Log Analytics - Failure analysis and root cause identification,
Availability assurance
−
Regulatory Compliance - Data Redaction to Identify and
protect sensitive information
8. Deploy to Streams and BigInsights
AQL Language
Extractor
Extractor
Optimizer
Text Analytics
Text Analytics
Module
Module
Compiled
Plan
Streams
Input
Documents
BigInsights
Cluster
Extracted
Information
Downstream
Integration
And processing
9. Developing an Extractor
Label examples of interesting text
Label clues or elements within or
around the examples
Bottom up
Create or refine AQL to
extract basic features
Create or refine AQL to
Generate candidate concepts
Create or refine AQL to
Filter and Consolidate
Top Down
Select documents to work with
10. AQL
Annotation Query Language
− SQL like
Familiar syntax and concepts make it easier to learn and
understand
−
Declarative
Describes what computation should be performed and not
how to compute it
Separates semantics from implementation
−
Compiled and optimized for execution
Text Analytics Module (TAM) is deployed to the cluster for
execution by the Text Analytics run time
11. AQL
Fundamental concepts
−
Views
Created with Select or Extract expressions
Are not materialized unless explicitly requested using
‘output view <name>’ or ‘select into’
The ‘Document’ view identifies the set of input documents
−
select… from Document d
12. AQL
Fundamental concepts
−
Extract expressions
Typically used to extract basic features
Extract from columns in other views including the text
column in the Document view
Basic capabilities include extraction using regex, dictionary
and sequence
Other operations include splits, blocks and parts of speech
13. AQL
Fundamental concepts
−
Select expressions
Typically used to combine, aggregate and filter extracted
fields to create candidate concepts and final values
Select existing columns and extract from columns
−
Specified using <from list>
Rich set of operators and clauses
−
where, consolidate, group by, order by, and limit clauses are
optional
14. Select vs Extract
Which do I use when?
−
Both have a <select list>
−
But you can only specify an <extract specification> in an extract expression
−
Both have a <from list>
−
You can apply simple predicate based filters in the <having clause> of an extract
expression or in the <where clause> of a select expression
−
But you can only use predicates to combine rows from views – join – using the <where
clause> of a select expression
−
You can apply a <consolidation policy> or a <limit> in either an extract or a select
expression
−
But you can only <group> and <order> using a select expression
extract
select
<select list>,
<select list>
<extraction specification>
from <from list>
from <from list>
[having <having clause>]
[where <where clause>]
[consolidate on <column> [using '<policy>' [with priority
from <column> [priority order]]]]
[consolidate on <column> [using '<policy>' [with priority
from <column> [priority order]]]]
[group by <group by list>]
[order by <order by list>]
[limit <maximum number of output tuples for each
document>];
[limit <maximum number of output tuples for each
document>];
15. Select vs Extract
If you need to extract – use an extract expression
If you need to group, order or join – use a select expression
extract
select
<select list>,
<select list>
<extraction specification>
from <from list>
from <from list>
[having <having clause>]
[where <where clause>]
[consolidate on <column> [using
'<policy>' [with priority from <column>
[priority order]]]]
[consolidate on <column> [using
'<policy>' [with priority from <column>
[priority order]]]]
[group by <group by list>]
[order by <order by list>]
[limit <maximum number of output
tuples for each document>];
[limit <maximum number of output
tuples for each document>];
17. Acquire the Data
Source social media data from BoardReader, an
IBM business partner with a commercial offering
that provides a searchable archive of various web
based data sources
19. Transform and Export using BigSheets
Extract a subset of social media data from a
BigSheets workbook populated with data from IBM’s
sample Boardreader application.
Inside a BigSheets workbook,
press the 'Export As' button
and export the workbook
using the aspects specified to
DFS
Download this file to the local
FS of the eclipse development
environment to use as sample
input data for text analytics
development
20. Building a Text Analytics Extractor
Working in the Eclipse environment you will build an
Extraction Plan and use the Extraction tasks Workflow to
develop and test a simple extractor
27. Additional Analytics
Develop and deploy additional extractors
−
Understand Watson applications in Healthcare
−
Understand the link with Jeopardy!
−
Understand the tone/sentiment
28. Additional Resources
Big Data Hub
http://www.ibmbigdatahub.com/
DeveloperWorks
http://www.ibm.com/developerworks/bigdata/
Big Data and Analytics on YouTube
http://www.youtube.com/ibmbigdata
Big Data University
http://www.bigdatauniversity.com/