Ibm swg day 2012 jhb big data (white)

Big Data
Simon Jeggo
24 May 2012

© 2011 IBM Corporation IBM Confidential

Agenda
What is Big Data

Some Big Data Use Cases

IBM’s Big Data Platform


What is
Big Data


The Big Data Challenge – a Term defined
 “Big Data is a term applied to data sets that are large, complex and dynamic (or a combination thereof) and for which there is a
requirement to capture, manage and process the data set in its entirety, such that it is not possible to process the data using
traditional software tools and analytic techniques within tolerable time frames.”
 New technologies that bring cost effective approaches to explore, understand and predict better business outcomes
 MPP databases
 Streams
 In-database analytics
 Apache Hadoop Automate

 Cloud computing platforms
 Archival storage systems
 Why something different?
 Data x Computation > typical warehouse
 Schema Flexibility
 Programming Flexibility
Integrate Secure

 We are engaged in over 50 clients, working with them to apply big data techniques to a class of problems -- e.g., text analytics, log analysis,
customer insights, fraud detection etc.
 We have a set of unique value-adds – JAQL, GPFS, System-T and others coming…
 And we can make BigData for our clients sit in their complex IT environment

4

© 2011 IBM Corporation
In 2005 there were 1.3 billion RFID
tags in circulation…

IBM Confidential

…by the end of 2011, this was about
30 billion and growing even faster

An increasingly sensor-enabled and instrumented
business environment generates HUGE volumes of
data with MACHINE SPEED characteristics…

1 BILLION lines of code
EACH engine generating 10 TB every 30 minutes!

350B Transactions/
Year

Meter Reads
every 15 min.

120M – meter reads/month 3.65B – meter reads/day

 In August of 2010, Adam Savage,
of “Myth Busters,” took a photo
of his vehicle using his
smartphone. He then posted the
photo to his Twitter account
including the phrase “Off to
work.”

 Since the photo was taken by his
smartphone, the image
contained metadata revealing the
exact geographical location the
photo was taken

 By simply taking and posting a
photo, Savage revealed the exact
location of his home, the vehicle
he drives, and the time he leaves
© 2011 IBM Corporation IBM Confidential work
for

The Social Layer in a Instrumented Interconnected World
4.6
30 billion RFID billion
tags today camera
12+ TBs (1.3B in 2005) phones
of tweet data world wide
every day

100s of
millions
of GPS
data every day

enabled
? TBs of

devices
sold
annually

25+ TBs of 2+
log data every billion
day people on
the Web
76 million smart by end
meters in 2009… 2011
200M by 2014

Twitter Tweets per Second Record Breakers of 2011
 Social-media analytics can be
used from healthcare to
predicting votes

 Challenges
– Volume
– Velocity
– Variety
– Language Processing: consider that
Twitter sentences are not well
formed and often use
urban talk


Extract Intent, Life Events, Micro Segmentation Attributes

Chloe

Name, Birthday, Family
Tom Sit

Not Relevant - Noise
Tina Mu

Monetizable Intent
Jo Jobs
Not Relevant - Noise

Location Wishful Thinking

© 2011 IBM Corporation Relocation
Monetizable Intent
IBM Confidential SPAMbots

Watson’s advanced analytic capabilities can sort through the equivalent of 200
© 2011 IBM Corporation
MILLION pages of data to uncover an answer in 3 SECONDS.
IBM Confidential

1.8 ZB

1 ZB
1 ZB=1T GB

4Trillion
8GB
iPods


Cisco turns to IBM big
data for intelligent
infrastructure
management
Big Data • Optimize building energy
consumption with centralized

Use •
monitoring
Automate preventive and
corrective maintenance
Cases
Capabilities Utilized:
• Streaming Analytics
• Hadoop System
• Business Intelligence

Applications:
• Log Analytics
• Energy Bill Forecasting
• Energy consumption optimization
• Detection of anomalous usage
© 2011 IBM Corporation IBM Confidential • Presence-aware energy mgt.

Applications for Big Data Analytics
Smarter Healthcare Multi-channel Finance Log Analysis
sales

Homeland Security
Traffic Control Telecom Search Quality

Manufacturing Trading Analytics Fraud and Risk Retail: Churn, NBO


Retail Industry

 Issues for the Retail Industry
 Deliver value to empowered customers
 Move from market analysis to understanding individuals
 Take charge of growing volume, velocity and variety of data
 Foster lasting connections
 Focus on relationships, not just transactions
 Invest in expanding the corporate brand
 Capture value, measure results
 Developing complete understanding of the point of sale
 Build new skills and solutions


Use Case: Social Media Analytics
Problem
 As consumers continue to adopt social media technologies, businesses must be able to track customer sentiment and brand perception, finding
new opportunities and avoiding business problems from negative perceptions

Structured/Unstructured data
Solution
 Social Media Analytics
 What consumers and the industry are saying

 Optimizing Internal Operations
 Better utilization of tools for web analytics
 Decreased latency for analysis

 Predictive Analytics
 Promotion targeting for offers
 Prospect harvesting
 POS analytics, predictive and discovery

 Competitive Intelligence
 Unlock information across the web
What is our next best offer?


Warehouse Off-load Use Case: Transactional Analytics

Problem
 Retailers have massive amounts of transaction data that offers a wealth of information about customer purchasing behavior in stores
 This data isn't being used effectively because of its volume, the cost to store it, and the barriers to analyzing massive data

Solution
 Store POS transactions in BigInsights, reducing the cost from
traditional data warehousing
 BigInsights enables ad-hoc query for historical reporting, trend
analysis, and analyst needs
 Data mining feeds for store and customer segmentation, market
basket analysis, promotion targeting and other analytics
based solutions
 Historical POS made available for analysis of new product
introductions, new store openings, and other disruptive business
events


FSS - Customer Correspondence Analytics
Problem
 Current approaches limit insight and predictive analytics to structured data, limiting insight and losing the “state”
of the customer
 Human-based review of correspondence is limited to small scale sampling
 Results of sampling are too dependant on the skills of reviewer and cannot learn from information sets outside of that
human reviewers knowledge
 Detecting and acting on rapidly changing customer sentiment and understanding why a service touch is occurring
from the customer POV
 The need to take cost out of service touch points while improving effectiveness/intamacy

Solution
 Use of un-instrumented or under-instrumented information source to identify and head-off issues
• Extends risk modeling to underutilized sources such as email, chat, social media, call center, and CSR interactions and notes
 Move from small scale sampling to 100% coverage using BigInsights and cross correlation of information sources
– Natural language analytics combined with machine learning to identify opportunities and issues that are not apparent in small sample sizes
and human awareness.
 Use of natural language sophisticated analytics to allow develop a predictive understanding customer actions based on
customer state
– Topic and sentiment extraction from email, chat, social media, call center, and CSR interactions and notes to predict call reasons and next
best action


FSS - Risk Platforms and Analytics
Problem
 Real-time analytics and need to meet SLA windows are outstripping existing infrastructure capabilities
 Burst-oriented trading close volumes and resulting position analytics are expanding faster than traditional technologies can
cost effectively meet
 Standard policies of flushing the data after hours or days is not meeting risk modeling needs
 Web, unstructured and machine generated data does not fit existing relational analytics tools
 SQL is not the natural tool to manipulate untapped information sources that can improve the dimension of risk modeling
 The changing nature of risk requires flexibility in sizing, speed and methods that are not easy to respond to with existing
SQL based platforms

Solution
 Predict, identify and triage risk anomalies in real-time
– Use of SystemT and SystemML analytics engines to identify problems based on historical data and then push those
models to Streams
 Use of BigInsights to ingest and analyze hundreds of TB an hour to meet SLA requirements for high
volume and complex trading operations
 Use of un-instrumented or under-instrumented information source to identify and head-off issues
• Extends risk modeling to underutilized sources such as email, chat, social media, call center, and CSR interactions
and notes


FSS - Social Media Analytics
Problem
 Important source of information, but requires new approaches to collecting, storing, understanding and utilizing the value to
be found.
 Fuzzy and messy data are the norm
 Little if any of the information is easily structured
 Reconciling external and internal sources
 Identifying individuals among the fog of external data is not easily done but is often necessary
 Linking to known individuals requires Entity analytics concepts and capabilities

Solution
 Ability to acquire, parse, analyze, link and persist external information sources to a variety of analytics
platforms
– Use of SystemT and SystemML analytics engines to digest and make sense of external sources

 Sophisticated text/language analytics to allow powerful and accurate understanding of the external
sources
– Entity resolution capabilities to match external sources to known customers and groups
– Graphical interfaces to quickly explore data sets, test hypothesis, create production jobs and synthesize data sources
from multiple disparate internal and external sources
– Ability to push normalized data to Netezza for analytics with existing methods and tools


Explosion of data in Telecom
From 500PB per month 2011
To 5,000PB per month 2016


Explosion of Data for Telecom
> 2 Billion Internet users 2011 How to lower
network costs
AT&T Global Network carries 24 ($/GB)?
Petabytes of data PER DAY How to improve
data revenue
5 Billon Mobile Phones WW Voice Traffic Network ($/GB)?
– 550K Android phone $/bit Dominant Cost
activated every day Profitability
Twitter process 7 terabytes
Traffic Gap
Volume
of data every day (value/GB)

Facebook processes 10
terabytes of data every day Revenues

Skype 300 Million Min of
Video Calls Per Month

YouTube – Massive bits through
Networks Data Dominant
48 Hours of Web of Video
uploaded per min Time
3 Billion views per day

Telecoms need to be smarter….. smarter networks and smarter business models
All Telecom Enterprises have BIG DATA CHALLENGES

Churn Prediction and Targeted Offers
with Social Media Text Analytics

Problem
 Lost revenue and increase customer acquisition cost is directly related to churn
 Churn not only lost customers due to pricing, but to service level, new tech offerings, service offerings, and
customer perception
 Significant challenge increasing ARPU
 Revenue per customer is much harder to increase as competition increases
 Current churn prediction systems are not up to the challenge
 Too slow and not using social media data

Solution

 Improve churn prediction using social media
– Analyze social media on its own or with current warehouse/BI analytics to predict churn quicker (real-
time) and more accurately
– BigInsights Text Analytics is the key to finding new analytics and Streams for RT alerts
 Discover ARPU opportunities directly from social media
– New source of customer intent and sentiment will drive new revenue opportunities
– Real time feedback to marketing systems or warehouse/BI to place offers quickly
– Finding ready-to-buy customers


Real Time CDR Analytics and Ingest

Problem
 Gathering CDR’s, mediating them into relevant data, and moving them to analytical systems is slow and
costly
 By the time CDR data is mediated and ingested by data warehouses, the ability to respond to problems is significantly
reduced.
 Systems tend to be old and require extensive application maintenance and hardware
 Cannot achieve real time billing, requires handling billions of CDRs per day, and de-duplication against 15
days worth of CDR data
Solution

 Big Data Streams Telecommunications Mediation and Analytics (TMA) offering
– Real-time CDR processing
– Real-time analytics and dashboard
– Unparalleled price/performance benefits
– Connectors to Warehouse and BigInsights
 Real-Time dashboards include:
– Dropped calls by high priority customers, location, providers, etc
– Terminated calls by location and customer type
– Revenue monitoring by voice and SMS
 The solution will enable novel Business Intelligence applications

CDR Analytics with Extended Data

Problem
 Telecom is experiencing an explosion of data from 3G and LTE (4G) network traffic. CDR’s are almost
only used for billing systems because storing and analyzing them was too expensive with EDW and BI
alone.
 Competition driving the need for focus on:
 customer retention
 customer profitability
 No connection between CDR, Web, and other data making everything from fraud detection to targeted
marketing to ad optimization difficult and expensive
Solution
 BigInsights for cost effective store of original data and large-scale text analytics
– Stores data unstructured and non-typed ingested with no data model
– Discovery and Analytics tools are built into BigInsights – Machine Learning extensions
– Integration to Netezza and DB2. JDBC to other data bases
 Big Data Streams Telecommunications Mediation and Analytics (TMA) offering
– Real-time CDR processing can be extended to other data sources – fast and low cost
 Netezza integration opens Big Data solutions to warehouse and BI
– Deep analytics and model development
– Can act as a high performance operational data store

Ad Effectiveness Analysis with Social Media

Problem
 Telecom and Media spend large sums of money on advertising. Measuring the effectiveness of the Ads
difficult and almost impossible online without costly services
 Service providers are slow with responses and expensive
 Current ad analysis is mostly guesswork and intuition – not lending itself to timely decisions
 Enterprises are demanding better ROI from ad budgets and proof of effectiveness of each ad campaign
 To increase effectiveness, enterprises have to react in near-real-time

Solution

 BigInsights used for social media ingest and fast analysis
– Answers questions like what was the awareness, who did we reach, and what was the reaction to an
ad in a few hours vs weeks
– Offers ad departments to react: modify, localize, and focus
 Streams for real-time ad analysis extending predictive models for fast reaction
 React very quickly to ad effectiveness
1. Adjust ad budgets
2. Tailor ad’s to geography
3. Alter messaging
4. Adjust targeted and direct marketing initiatives

Why IBM
for Big Data
The Solution Side


The IBM Big Data Platform

InfoSphere BigInsights
Hadoop-based low latency
analytics for variety and volume

Hadoop

Information Integration Stream Computing

InfoSphere Information Server InfoSphere Streams
High volume data integration and Low Latency Analytics for
transformation streaming data

MPP Data Warehouse

IBM Smart Analytics
IBM InfoSphere IBM Netezza High IBM Netezza 1000
System
Warehouse Capacity Appliance BI+Ad Hoc
Operational Analytics on
Large volume structured data Queryable Archive Analytics on Structured Data
Structured Data
analytics Structured Data


A Big Data Platform
Analytics Excellence In-Motion Operational Excellence At-Rest Operational Excellence
Text Analytics Toolkit Unrivalled…. Harden Hadoop - GPFS
Machine Learning Toolkit
Industry Accelerators Development Embrace and Extend Surface Area Lock Down

Tooling Visualization Tooling Policy Driven Retention & Immutability
Deployment Tooling (“App Store”) Role-Based Security
$14B in 5 yrs. on Analytics Adaptive MapReduce
+++ Workload Manager
Fast Splittable CMX Compression
REST-exposed Administration ++
+

In-Motion Open Source
At-Rest
IBM Big Data
Hadoop
Analyze extreme amounts of Platform Beyond traditional
data in milliseconds structured data
Uses same analytics as BigInsights BigInsights uses same analytics as Streams

Data can be analyzed on the way into No forked, not ported: Hadoop Extended with
the enterprise for earlier pattern operational excellence and security
detection Netezza for in-database MapReduce
MPP Data Warehouses

Stream Computing: A new paradigm for ultra low latency
and high throughput in-motion analytics
Continuous Ingestion Continuous Queries /Analytics on data in motion


Data In Motion
 Information used to be aggregated and analyzed every 30-60
minutes and discarded after 72 hours
 Analyzing 1000 pieces of unique medical diagnostic information
per/sec. and stored in a dynamic model
 Perspective: 20% drop in mortality of control group in trials
(extend approach to daily activities)
- 120 children monitored:120K messages/sec…billions/day


Data In Motion

 Hear what’s going on miles away to optimize
perimeter displacements

 Perspective: Try to find the word “Zero” in a
1000 MP3 song library in a fraction of a second
– Figure out the difference between the sound of a
human whisper and the wind


Data In Motion – Improving What They
Already Have
 Old Microsoft-based solution not able to keep up with
new 3G demands for their real-time xDR analysis
business requirements

 Streams and Netezza solution proposed
– Time to merge and load data reduced 90%+
– Time to market for new products from 4 hours to minutes

Internal Use Only Reference

How Text Analytics Works

Football World Cup 2010, one team distinguished
themselves well, losing to the eventual champions 1-0 in
the Final. Early in the second half,

Netherlands’ striker, Arjen Robben, had a breakaway, but
the keeper for Spain, Iker Casilas made the save. Winger
Andres Iniesta scored for Spain for the win.
World Cup 2010 Highlights

Arjen Robben Striker Netherlands
Iker Casilas Keeper Spain
Andres Iniesta Winger Spain


IBM Text Analytics Toolkit Lets You…
 Build out world-class text analysis applications 50% faster than manual method
 Run faster text analysis (10x or more vs. some marketplace alternatives)
 Get more precise and correct answers (2x vs. some marketplace alternatives)


What is BigSheets?
Browser-based Big Data analytics tool for business users
Big Data Challenges… How can BigSheets help?
 Business users need a no  Spreadsheet-like discovery interface lets
programming approach for business users easily analyze Big Data
analyzing Big Data with ZERO PROGRAMMING

 Extremely difficult to find
 BUILT-IN “readers” can work with data
actionable business insights in
in several common formats
data from multiple sources with
– JSON arrays, CSV, TSV, Web
different formats
crawler output, . . .

 Translating untapped data into
 Users can VISUALLY combine and
actionable business insights is a
common requirement that requires explore various types of data to identify
visualization “hidden” insights


Big Data Made Easy for the Little Guy

 USC’s Film Forecaster correctly predicted a
clamor for "Hangover 2” that resulted in $100
million opening over Memorial Day weekend
– Looked at 250K-500K Tweets and broke down
positive and negative messages using a lexicon
of 1700 words

The Film Forecaster sounds like a big
undertaking for USC, but it really came
down to one communications masters
student who learned Big Sheets in
a day, then pulled in the tweets and
analyzed them - Ryan Kim


Why IBM for Big Data?
 Only IBM is showing data-in-motion and data-at-rest analytics: a bigger more
opportunistic view of Big Data

 Development and research sit side by side
 Virtualization tooling, development, file system, analytics
 Not just same company: same org, same people, same leadership

 BigInsights being used in
IBM products today such
as Cognos Consumer Insight


Without a Big Data Platform IBM Big Data Platform
You Code…

Over 100 sample applications and
toolkits with industry focused toolkits
Event Custom SQL with 300+ functions and operators!
Handling and
Scripts
Multithreading Streams provides development, deployment,
runtime, and infrastructure services
Check Application
Pointing Management
Accelerators
HA and

Toolkits

Performance Debug
Connectors
Optimization

“TerraEchos developers can deliver
Security applications 45% faster due to the agility
of Streams Processing Language…”
– Alex Philip, CEO and President

THINK

https://w3-connections.ibm.com/wikis/home?lang=en_US#/wiki/Info%20Mgmt%20Client%20Technical
%20Professional%20Resources%20Wiki/page/Understanding%20Big%20Data
42

Ibm swg day 2012 jhb big data (white)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Ibm swg day 2012 jhb big data (white)

Semelhante a Ibm swg day 2012 jhb big data (white) (20)

Último

Último (20)

Ibm swg day 2012 jhb big data (white)

Notas do Editor