There has been an explosion in database technology designed to handle big data and deep analytics from both established vendors and startups. This session will provide a quick tour of the primary technology innovations and systems powering the analytic database landscape—from data warehousing appliances and columnar databases to massively parallel processing and in-memory technology. The goal is to help you understand the strengths and limitations of these alternatives and how they are evolving so you can select technology that is best suited to your organization and needs.
Presentation from the O'Reilly Strata conference, February 2011
Potential of AI (Generative AI) in Business: Learnings and Insights
Determine the Right Analytic Database: A Survey of New Data Technologies
1. Determine the Right
Analytic Database:
A Survey of New
Data Technologies
O’Reilly Strata Conference
February 1, 2011
Mark R. Madsen
http://ThirdNature.net
Twitter: @markmadsen
Atomic Avenue #1 by Glen Orbik
8. Technology Has Changed (a lot) But We Haven’t
1010
10 9
10,000 X improvement
Calculations per second per $1000
10 8
107
106
105
104
103
102
101
10
10‐1
01‐2 Current DW architecture
10‐3
and methods start here
10‐4
10‐5
in the mid-1980s
10‐6
Data: Ray Kurzweil, 2001
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
Mechanical Relay Vacuum tube Transistor Integrated circuit
9. Moore’s Law via the Lens of the Industry Analyst
CPU
Speed
Time
14. Problem: linear extrapolation
“If the automobile had followed Reality
the same development as the
computer, a Rolls-Royce would
today cost $100, get a million
miles per gallon, and explode
once a year killing everyone
inside.” Anything
Robert Cringely
Time
24. In‐Memory Processing
1. Maybe not as fast you think. Depends entirely on
the database (e.g. VectorWise)
2. So far, applied mainly to shared‐nothing models
3. Very large memories are more applicable to
shared‐nothing than shared‐memory systems
Box‐limited Limited by node scaling
e.g. 2 TB max e.g. 16 nodes, 512MB per = 8TB
4. Still an expensive way to get performance
25. Columnar Databases
In a row-store model
ID Name Salary these three rows
1 Marge Inovera $50,000 would be stored in
2 Anita Bath $120,000 sequential order as
shown here, packed
3 Nadia Geddit $36,000
into a block.
1 Marge Inovera $50,000 In a column store
2 Anita Bath $120,000 model database they
would be divided by
3 Nadia Geddit $36,000 columns and stored
in different blocks.
Not just changing the storage layout. Also involves changes to the
execution engine and query optimizer.
28. Explosion of Analytic Techniques
Machine
learning
Visualization Statistics
GIS
Advanced
Analytic
Information Methods Numerical
theory & IR methods
Rules Text mining
engines & & text
constraint analytics
programming
32. How Hadoop fits into a traditional BI environment
Developers Analysts End Users
Development Analysis tools, BI BI, Applications
tools and IDEs
Data
Warehouse
File loads ETL
Databases Documents Flat Files XML Queues ERP Applications
Source Environments
33. NoSQL theoretically = “not only sql”, in reality…
Data stores that augment or replace relational access
and storage models with other methods.
Different storage models:
• Key‐value stores
• Column families
• Object / document stores
• Graphs
Different access models:
• SQL (rarely)
• programming API
• get/put
Reality: mostly suck for BI & analytics
Analytic DB vendors are coming from the other direction:
• Aster Data – SQL wrapped around MR
• EMC (Greenplum) – MR on top of the database 33
34. Some realities to consider
Cheap performance?
▪ Do you have 20 blades
lying around unused?
▪ How much concurrency?
▪ How much effort to write
queries? Debug them?
▪ Performance comparisons:
10x slower on the same
hardware?
The key is the workload type
and the scale of it.
Page 34
35. Do you really need a rack of blades for computing?
Graphics co‐processors have
been used for certain problems
for years.
Offer single‐system solution to
offload very large compute‐
intensive problems.
Order of magnitude cost
reduction, order of magnitude
performance increase with
current technology today (for
compute‐intensive problems).
We’ve barely started with this.
36. Other Options for analytic software deployment
The basic models.
1. Separate tools and systems
(MapReduce and nosql are a
simple variation on this theme)
2. Integrated with a database
3. Embedded in a database
The primary arguments about
deployment models center on
whether to take data to the
code or code to the data.
36
37. Leveraging the Database
Levels of database integration:
▪ Native DB connector
▪ External integration
▪ Internal integration
▪ Embedded
+ Less data movement
+ Possible dev process support
+ Hardware / environment
savings
+ Possible “sandboxing” support
‐ Limitations on techniques
37
39. What are factors in the decision?
User concurrency: one job or many
Repetition is a key element:
▪ Execute once and apply (build a response
or mortality model)
▪ Many executions daily (web cross‐sells)
In‐process or Batch?
▪ Batch and use results – segment, score
▪ In‐process reacts on demand – detect
fraud, recommend
In‐process requires thinking about how it
integrates with the calling application. (SQL
sometimes not your friend) 39
45. Hardware Architectures and Deployment
Compute and data sizes are the key requirements
PF
MR and related
Computations
TF
Shared nothing
GF
Shared everything
PC or shared disk
MF
<10s GB 100s GB 1s TB 10s TB 100sTB PB
Data volume
45
48. The real question: why do you want a new platform?
Trouble doing what you already do today
▪ Poor response times
▪ Not meeting availability deadlines
Doing more of what you do today
▪ Adding users, mining more data
Doing something new with your data
▪ Data mining, recommendations, embedded real‐time
process support
What’s desired is possible but limited by the cost of
supporting or growing the existing environment.
Page 48
50. The assumption of the warehouse as a database is gone
Non-traditional Parallel
Message
data (logs, audio, programming
streams
documents) platforms
Traditional Streaming
Databases
tabular or DBs/engines
structured data
Data at rest Data in motion
Copyright Third Nature, Inc. 50 Slide 50
53. What’s it going to cost? A small sample at list:
Solution Pricing model Price/unit 1 TB solution Remarks
DatAupia Node $ 19,500/2TB $ 19,500 You can’t buy a 1
TB Satori server
Kickfire Data Volume $ 50,000,-/TB $ 50,000 Includes MySQL
(out of (raw) 5.1 Enterprise
business)
Vertica Data Volume $ 100,000/TB $ 200,000 Based on 5 nodes,
(raw) $ 20,000 each
ParAccel Data Volume $ 100,000/TB $ 200,000 Based on 5 nodes,
(raw) $ 20,000 each
EXASOL Data Volume $ 1,350/GB $ 350,000* Based on 4 nodes,
(active) (€1,000/GB) $ 20,000 each
Teradata Node $ 99,000 / TB $ 99,000** Based on 2550
base configuration
* 1TB raw ± 200 GB active, **realistic configuration likely 2x this price
53
55. The Path to Performance
1. Laborware – tuning
2. Upgrade – try to solve the
problem without changing
out the database
3. Extend – add an ADB or
Hadoop cluster to the
environment to offload a
specific workload
4. Replace – out with the old,
in with the new
Page 55
61. About the Presenter
Mark Madsen is president of Third
Nature, a technology research and
consulting firm focused on business
intelligence, analytics and
performance management. Mark is
an award-winning author, architect
and former CTO whose work has
been featured in numerous industry
publications. During his career Mark
received awards from the American
Productivity & Quality Center, TDWI,
Computerworld and the Smithsonian
Institute. He is an international
speaker, contributing editor at
Intelligent Enterprise, and manages
the open source channel at the
Business Intelligence Network. For
more information or to contact Mark,
visit http://ThirdNature.net.
62. About Third Nature
Third Nature is a research and consulting firm focused on new and
emerging technology and practices in business intelligence, data
integration and information management. If your question is related to BI,
open source, web 2.0 or data integration then you‘re at the right place.
Our goal is to help companies take advantage of information-driven
management practices and applications. We offer education, consulting
and research services to support business and IT organizations as well as
technology vendors.
We fill the gap between what the industry analyst firms cover and what IT
needs. We specialize in product and technology analysis, so we look at
emerging technologies and markets, evaluating the products rather than
vendor market positions.