3. Consider the following:
• New model for data
• Accessible over TCP/IP and variety of languages
• Initially difficult to understand
• Capable of processing thousands of ops/sec
• Very different from old model
• Threatening as much was invested in old model
• Changing course seems ridiculous
Source: Eben Hewitt
5. IBM IMS
“IMS is IBM's premier transaction and hierarchical database
management system, virtually unsurpassed in database and
transaction processing availability and speed” – IBM 2013
“Mission-critical processing that requires unparalleled
performance is best served by a hierarchical model. Analytics
and business intelligence are best served by a relational
model. Most Fortune 100 companies use both.”
Source: IBM
6. Data evolution
A New Model Is Invented
A Disruptive Model
A Threatening Model
A Competitive Model
Source: Eben Hewitt
10. innovation complexity
confusion
a new model
disruption
fierce competition
Sound familiar?
11. Big data – a growing torrent
$600 to buy a disk drive that can
store all of the world’s music
5 billion mobile phones
in use in 2010
30 pieces of content shared
on Facebook every month
billion 40% projected growth in global data
generated per year vs.5%
235 terabytes data collected by the
U.S. Library of Congress by April 2011
growth in global
IT spending
15 out of 17
sectors in the United States have more data
stored per company than the U.S. Library of Congress
Source: McKinsey
14. Big data confusion?
What do business executives
think “big data” is?
A greater scope of information 18%
New kinds of data and analysis 16%
Real-time information 15%
Data influx from new technologies 13%
Non-traditional forms of media 13%
Large volumes of data 10%
The latest buzzword 8%
Social media data 7%
Source: IBM
15. Big data is…
Large pools of data
that can be captured,
communicated,
aggregated, stored,
and analyzed
Source: McKinsey
23. Big data innovation incubated
Big data innovation incubated
A search engine project at Yahoo
Doug Cutting = Nutch
Google = GFS and GMR
24. eBay erected a Hadoop cluster
spanning 530 servers –
now five times the size!
“Hadoop is an amazing
technology stack. We now
depend on it to run eBay.”
Bob Page,
Vice President of Analytics, eBay
Source: http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop/
25. It can get complex
and confusing
“It replaced our need
for ETL”
“It is great for batch
processing in parallel”
“A beautiful platform
for all of problems”
26. What it’s not good for
• High volume transactional data
• Structured data with low latency
“Note that Hadoop is not an Extract-Transform-
Load (ETL) tool. It is a platform that supports
running ETL processes in parallel. The data
integration vendors do not compete with
Hadoop; rather, Hadoop is another channel
for use of their data transformation modules. “
Teradata/Cloudera Presentation
27. What it’s really good for
• Index building
• Pattern recognitions
• Sentiment analysis
• Machine generated data
• Log processing
• Web scale = Google, Twitter,
YouTube
28. Use Cases
Fraud Detection
Spot fraud anomolies
Mobile Data
Process mobile data
Online Travel Reservations IT Security
Travel booking Analyze machine generated data
Image Processing E-Commerce
Large marketplaces
Detecting patterns in sat imagery
HealthCare
Energy Discovery Semantic analysis for relevance
Sort and process seismic data
Energy Savings
Infrastructure Management Suggest ways customers save money
Collecting device logs
32. Relational is still in play
Some innovations worth a look
Dynamically Scaling OLTP = “No Need To Shard”
33. The NoSQL generation
• Document Storage Model • Released by NSA to open source
• Allows MTV to store • Apache Accumulo
hierarchical data • Based on Google Big Table
• Flexible schema to model • Built on top of Hadoop
structure/data by brand • Fine-grained access control
• Needed to have ability • Cell level security
to query nested content • Server side programming
• No need for a shared
disk storage
34. Why NoSQL?
• Schemaless model = Easy to to add fields
• Document oriented = Json format (think objects)
• Built from the ground up to be distributed
• Auto sharding
• Distributed querying capabilities
35. NoSQL Use Case
1. Click/Event into Hadoop
2. Data Analyzed via Map Reduce jobs;
generates 100M profiles based on
campaigns running
3. Selected profiles loaded into Couch
4. Ad targeting logic query Couch with
sub-second latency to optimize
decision and real-time ad placement
Source: Couchbase
36. Hadoop Augmentation
• Side-by-Side will be commonplace
• ETL solutions support Hadoop
• Relational Databases
• Provide ETL interfaces to Hadoop
• Execute map/reduce jobs inside DBMS
• NoSQL supports ETL
37. Example Hybrid DBMS Systems
Oracle Endeca Server
• Hybrid Search/Analytic Database
• Supports structured, unstructured, semi-structured
• No schema required. Records stacked.
• Columnar
38. Trends
• SQL On Hadoop – Hadapt, Clodera Impala, EMC
• Unified Support of Structured, Unstructured, Semi
• Embedding Search
• Expanded ETL/ELT Support
• Big Data In Motion Takes Hold
• Added Data Mining and Analytic Functions In NoSQL
• Embedding R Language = gain in popularity
• Data Scientists instrumental in business success