Presented by: Dr. Bruce Aldridge, Sr. Industry Consultant Hi-Tech Manufacturing, Teradata
TIBCO Spotfire and Teradata: First to Insight, First to Action; Warehousing, Analytics and Visualizations for the High Tech Industry Conference
July 22, 2013 The Four Seasons Hotel Palo Alto, CA
1. Teradata Proprietary and Confidential
BIG DATA ANALYTICS MEANS
“IN-DATABASE” ANALYTICS
Dr. Bruce Aldridge
Sr. Industry Consultant
Hi-Tech Manufacturing
Teradata
760.458.1376
bruce.aldridge@teradata.com
2. 2 7/30/2013 Teradata Confidential
Overview of Topics
• “Big Data” Analytics
> The problems of extreme data
> Key principles for analytic engines
• Analytic Technologies
> Changing from sequential to parallel
> Design for analytics
• Operationalizing Analytics
> Analytic life cycle management
> Visualization / interacting
3. 3 7/30/2013 Teradata Confidential
What is “Big Data”?
Big Data: any information that’s too fast,
too large or doesn’t fit what you are using
Data Explosion
> Automation of equipment
and business processes
> Sensor integration
> Communication
(networks / web)
> Compliance
4. 4 7/30/2013 Teradata Confidential
Using Big Data
• Collecting data and using
data are different things
> Data Lakes serve as high
volume low cost
repositories for collection
> Data may be semi-structured
or structured - frequently the
conversion happening within the repository
> Large amounts of data may be stored for
reporting, compliance or investigations
• Unusual or new events provide learning
(Most “big data” will not provide new
information or knowledge)
5. 5 7/30/2013 Teradata Confidential
Guidelines for Big Data
• Collecting ≠ learning ≠ Using data
> Data stored on appropriate system for use
> Data mining and statistic tools for learning
> Model publication (PMML) & monitor for deployment
> Visualization tools critical for all
6. 6 7/30/2013 Teradata Confidential6 > 7/30/2013
Extreme data brings new challenges
• New techniques to limit variables for
analysis / modeling
• Emergence of columnar analytics
• Wealth of data results in
more variables than
responses
𝑦 𝑚 = 𝑓 𝑥1, 𝑥2, 𝑥3, 𝑥4, … , 𝑥 𝑛
where n>m
• Data organization struggles
with wide data
(>100,000 columns)
Id V1 V2 V3 V4 V5 V6 V7 V8 v9
AA 1.2 3.1 41 56 ‘a’ 9 0.2 ? ?
AB 0.9 2.7 41 62 ‘a’ 8 0.2 1.1 7
BA 1.0 2.9 42 57 ‘b’ 9 0.1 1.1 ?
Id Col
ID
Val
AA V1 1.2
AA V3 41
AB V1 0.9
AB V8 1.1
AB V2 2.7
“pivot”
Id V1 V2 V3 V4 V5 V6 V7 V8 v9
CA 1.2 3.1 41 56 ‘a’ 9 0.2 ? ?
CB 0.9 2.7 41 62 ‘a’ 8 0.2 1.1 7
BB 1.0 2.9 42 57 ‘b’ 9 0.1 1.1 ?
Id Col
ID
Val
AA V1 1.2
AA V3 41
AB V1 0.9
AB V8 1.1
AB V2 2.7
CA V4 56
CB V3 41
BB V1 1.0
Multiple
tables add
more rows
7. 7 7/30/2013 Teradata Confidential
Technology Requirements for
“Big Data” Analytics
• Need for large amounts of data storage
• Ability to get at the data (SQL)
• Availability of tools for
> Visualization
> Characterizing, organizing
and cleaning data
> Summarizing (descriptive statistics)
> Analyzing (predictive models,
data discovery)
> Monitoring & reporting
• Analytic Fault Tolerance (massive
systems imply more failures)
• Dynamic growth – ability to add more
capability without “starting over” –
mixing technologies
• ROI
llll llll
8. 8 7/30/2013 Teradata Confidential
Analytic Tools
• Faster analytics require a different approach – Parallel
> Sequential processing will be limited
> Parallel analytics distributes calculations across multiple nodes
with each node having the data necessary
> Management of calculation (distribution) and collection
• Because data is generally stored on multiple nodes, so…..
No choice but to bring the analytics to the data.
Data
Analytic
Modeling
Tools
Business Results
Local Data
repository
Parallel
Analytic
Procedures
Simple reporting /
management tools
Data
9. 9 7/30/2013 Teradata Confidential
Putting it all together: Analytic Architecture
LANGUAGES MATH & STATS DATA MINING
DISCOVERY
PLATFORM
LOW COST – HIGH CAPACITY PARALLEL DATA LAKE
CAPTURE | STORE | REFINE
LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS
FLEXIBLE ANALYTIC
/ DISCOVERY
PLATFORM
REPORTING /
MONITOR SYSTEM
OF RECORD -
DATA WAREHOUSE
AUDIO & VIDEO IMAGE
S
TEXT WEB & SOCIAL MACHINE LOGS CR
M
SCM ER
P
Environment for:
• Low Cost high capacity storage
• High power analytics
• Fault tolerant high performance reporting
• Exploration / visualization across all areas
Visualization
exploration
10. 10 7/30/2013 Teradata Confidential
Data Preparation
Transform, clean and
aggregate data to form data
set suitable for analysis
Monitor / Model
Deployment
Deploy statistical model to run
iroutinely - automatically
monitoring for control
Data Exploration
Explore all data with statistical
profiling and visualization
Understand / Model
the data
Apply mathematical /
relational models to test
hypotheses about the data
Modeling ADS
Sample
Data
Build
ADS
Production ADS
Automated process
Analytics Process
SQL In-dbs
Function
PMML or
UDF Models
11. 11 7/30/2013 Teradata Confidential
• Business / Data understanding
> Defining objects and requirements of
the business
> Data collection and data profiling /
characterization
• Data preparation – joins between
tables, attribute selection, cleaning,
building new values
• Modeling: Analytic algorithms
applied and parameters adjusted
• Evaluation: results scored according
to objectives and requirements
• Deployment: Models and
parameters put into on demand or
automatic execution on new data
Analytic discovery process
CRISP – Cross Industry
Standard Process –
data mining
12. 12 7/30/2013 Teradata Confidential
Analysis – The generation of knowledge
Generation of knowledge is iterative and interactive
• An idea related to a problem or observation is formulated
• Data is collected to support or refute the idea (deduction – what kind of
data is necessary?)
• Analysis is made on the data to validate or refute (induction)
• Results either support / reject idea or suggest modifications
Monitoring
• Known analytic models used for prediction / verification
• Adjust / control based on prediction vs. observation
• Business scoring used to prioritize
Data (facts, phenomena)
Idea (model, hypothesis, theory, conjecture)
Monitor / controlValidate Revise
13. 13 7/30/2013 Teradata Confidential
Establishing a Robust Environment
Quality Information
Master Data Management
Data Profiling / visualization
Logical/Physical Model
Data “correction”
Data Steward/Cleansing processes
Discovery
Statistical / Data Mining Tools
Secure access
Robust Analysis capabilities
Visualization / understanding
Clear and significant results
Flexibility in data and models
Automation and Alerts
Simple publication of discovery
knowledge
Automated pattern/anomaly detection
Business scoring for notification and
escalation
Clear communication of results
Visualization and Reports
Choice of the tools to match needs
(e.g. Dashboard vs. Engineering views)
Timing and need for data refresh
Reporting on Core or staging
Consistent use of metrics/results (e.g.
analytics in database vs. at the
reporting layer)
14. 14 7/30/2013 Teradata Confidential
Analytics – Key Requirements
• Performance:
> Parallel processing - true shared nothing architecture
> Data structure influences analytics (order of magnitude)
> Management of analytics and data critical
• Fault tolerance
> More nodes WILL result in more failures
> Analytic Fault Tolerance is more than database fault
tolerance – the ability to avoid restarting the analytics
• Different node performance
> Execution in parallel will never be identical – adjust for node
differences
> System expansions must be compatible
• Flexible analytics
> Big data analytics combine queries with analytic functions
> Analytic languages not parallel (in general) – need ability to
add / customize new functions
15. 15 7/30/2013 Teradata Confidential
Analytic Applications
• Existing parallel analytics
> In-database proprietary
> In-database addons (Fuzzy Logix, SAS, Partial R)
> Hybrid (Aster) – Database architecture
supporting MAP-Reduce functions
• Many existing applications moving parallel
> SAS: Partnered with Teradata for seamless in-
database execution of more analytics
> R: Partnered with Revolution R for rapid data
extraction and execution of some analytics in-
database AND in parallel
> Spotfire: Execution of aggregation analytics and
ability to define in-database analytic functions.
Embedded TERR (Tibco Enterprise Runtime R)
• Write your own
> Map reduce framework
> User defined functions
16. 16 7/30/2013 Teradata Confidential
16 >
7/30/2013
Analytic Libraries and Enhancements
Database built in:
• Descriptive Statistics
• Basic data mining models
(regression, cluster, trees, PCA)
• User defined functions
Partners
• Revolution R, SAS, Fuzzy
Logix, Spotfire, …
Enhancements
• High Speed connections
• “Native” data storage
17. 17 7/30/2013 Teradata Confidential
Device
Lot
Raw
Data
Wafer
Dashboard as an Analytic Tool
• The Dashboard becomes a 2-way interface
• User interaction parameterizes and launches new
analytics
18. 18 7/30/2013 Teradata Confidential
Integration of “Dashboard”
• Reporting / visualization tool with ability to execute custom
functions in-database
> Empower all users - ability to publish in-database analytics to users
19. 19 7/30/2013 Teradata Confidential
Monitor Analytics
• Analytic models generally are
published into SQL compatible
queries
• Applying models to data involves:
> Gather and format data for analytic
> Group data into consistent sets
> Screen data
> Apply algorithms
> Evaluate results
• Complete Sequence applied to
> Massive amounts of analyses
> Repetitive / automated analyses
> Scoring / Triage to identify most
significant results
20. 20 7/30/2013 Teradata Confidential
An Analytic Monitor approach
User direct edit of
Group Description Table
(very infrequent)
View for
Instances
Stage
Data
Group
Instance
Table
Group
Description
Table
Core
EDW
Data
Model
Reporting,
BI
& Alert
management
tools
Alert
settings
Core
EDW
Data
Model
Core
EDW
Data
Model
Core
EDW
Data
Model
Standard
ETL
Views for
Data
(Group
Instance)
Creation of Views
to evaluate
load data
(installation /
dba level user)
Analytic
procedures
1) Update Group
instance Table
with new /
changed data
2) Identify new
& core data
for required
calculations
3) Perform
calculations
with standard
or custom
libraries
4) Compare
results to
business
rules or
statistical
tests
5) Update alert
and report
flags
Result
Table
Work
Tables
(optional)
Workflow
Status /
Control
Table
ETL starts Analytic
Stored Procedure
And verifies
completion
21. 21 7/30/2013 Teradata Confidential
Statistical summaries vs……
Powerful in-DB Analysis enables the use of Simple BI Tools
22. 22 7/30/2013 Teradata Confidential
Data Graphs
Powerful in-DB Analysis enables the use of Simple BI Tools
Analysis of
Big Data:
76M rows of
Telemetry
(16 of 159
plots/units
shown )
graphed and
stored in-
database for
evaluation.
Engine hrs. vs. date
23. 23 7/30/2013 Teradata Confidential
Summary
• Analytics on “Big Data” will require one or more high
performance (parallel) systems connected to an
interactive interface
• Support combinations of high volumes (data
lakes), high performance and flexible advanced
analytics
• Tools for understanding,
cleansing, discovery
AND monitoring
necessary
• Interactive Visualization
support across all
systems
• Management of
analytics and fault
control