O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

One Size Doesn't Fit All: The New Database Revolution

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 63 Anúncio

One Size Doesn't Fit All: The New Database Revolution

Baixar para ler offline

Slides from a webcast for the database revolution research report (report will be available at http://www.databaserevolution.com)

Choosing the right database has never been more challenging, or potentially rewarding. The options available now span a wide spectrum of architectures, each of which caters to a particular workload. The range of pricing is also vast, with a variety of free and low-cost solutions now challenging the long-standing titans of the industry. How can you determine the optimal solution for your particular workload and budget? Register for this Webcast to find out!

Robin Bloor, Ph.D. Chief Analyst of the Bloor Group, and Mark Madsen of Third Nature, Inc. will present the findings of their three-month research project focused on the evolution of database technology. They will offer practical advice for the best way to approach the evaluation, procurement and use of today’s database management systems. Bloor and Madsen will clarify market terminology and provide a buyer-focused, usage-oriented model of available technologies.

Webcast video and audio will be available on the report download site as well.

Slides from a webcast for the database revolution research report (report will be available at http://www.databaserevolution.com)

Choosing the right database has never been more challenging, or potentially rewarding. The options available now span a wide spectrum of architectures, each of which caters to a particular workload. The range of pricing is also vast, with a variety of free and low-cost solutions now challenging the long-standing titans of the industry. How can you determine the optimal solution for your particular workload and budget? Register for this Webcast to find out!

Robin Bloor, Ph.D. Chief Analyst of the Bloor Group, and Mark Madsen of Third Nature, Inc. will present the findings of their three-month research project focused on the evolution of database technology. They will offer practical advice for the best way to approach the evaluation, procurement and use of today’s database management systems. Bloor and Madsen will clarify market terminology and provide a buyer-focused, usage-oriented model of available technologies.

Webcast video and audio will be available on the report download site as well.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Quem viu também gostou (20)

Anúncio

Semelhante a One Size Doesn't Fit All: The New Database Revolution (20)

Mais de mark madsen (20)

Anúncio

Mais recentes (20)

One Size Doesn't Fit All: The New Database Revolution

  1. One Size Doesn’t Fit All: The New Database Revolution Mark Madsen & Robin Bloor
  2. Your Host Eric.kavanagh@bloorgroup.com
  3. Analysts Host Bloor Madsen
  4. Introduction Significant and revolutionary changes are taking place in database technology In order to investigate and analyze these changes and where they may lead, The Bloor Group has teamed up with Third Nature to launch an Open Research project. This is the final webinar in a series of webinars and research activities that have comprised part of the project All published research will be made available through our web site: Databaserevolution.com
  5. Sponsors of This Research
  6. General Webinar Structure Market Changes, Database Changes (Some Of The Findings) Workloads, Characteristics, Parameters A General Discussion of Performance How to Select A Database
  7. Market Changes, Database Changes
  8. Database Performance Bottlenecks CPU saturation Memory saturation Disk I/O channel saturation Locking Network saturation Parallelism – inefficient load balancing
  9. Big Data = Scale Out
  10. Cloud Hardware Architecture • It’s a scale-out model. Uniform virtual node building blocks. • This is the future of software deployments, albeit with increasing node sizes, so paying attention to early adopters today will pay off. • This implies that an MPP database architecture will be needed for scale. X
  11. Multiple Database Roles Now there are more...
  12. The Origin of Big Data
  13. Let’s Stop Using the Term NoSQL As the graph indicates, it’s just not helpful. In fact it’s downright confusing.
  14. NoSQL Directions Some NDBMS do not attempt to provide all ACID properties. (Atomicity, Consistency, Isolation, Durability) Some NDBMS deploy a distributed scale-out architecture with data redundancy. XML DBMS using XQuery are NDBMS. Some documents stores are NDBMS (OrientDB, Terrastore, etc.) Object databases are NDBMS (Gemstone, Objectivity, ObjectStore, etc.) Key value stores = schema-less stores (Cassandra, MongoDB, Berkeley DB, etc.) Graph DBMS (DEX, OrientDB, etc.) are NDMBS Large data pools (BigTable, Hbase, Mnesia, etc.) are NDBMS
  15. The Joys of SQL? SQL: very good for set manipulation. Works for OLTP and many query environments. Not good for nested data structures (documents, web pages, etc.) Not good for ordered data sets Not good for data graphs (networks of values)
  16. The “Impedance Mismatch” The RDBMS stores data organized according to table structures The OO programmer manipulates data organized according to complex object structures, which may have specific methods associated with them. The data does not simply map to the structure it has within the database Consequently a mapping activity is necessary to get and put data Basically: hierarchies, types, result sets, crappy APIs, language bindings, tools
  17. The SQL Barrier SQL has: DDL (for data definition) DML (for Select, Project and Join) But it has no MML (Math) or TML (Time) Usually result sets are brought to the client for further analytical manipulation, but this creates problems Alternatively doing all analytical manipulation in the database creates problems
  18. Hadoop/MapReduce Hadoop is a parallel processing environment Map/Reduce is a parallel processing framework Hbase turns Hadoop into a database of a kind Hive adds an SQL capability Pig adds analytics
  19. Market Forces A new set of products appear They include some fundamental innovations A few are sufficiently popular to last Fashion and marketing drive greater adoption Products defects begin to be addressed They eventually challenge the dominant products
  20. Market forces affecting database choice Performance: trouble doing what you already do today ▪ Poor response times ▪ Not meeting data availability requirements Scalability: doing more of what you do today ▪ Adding users, processing more data Capability: doing something new with your data ▪ Data mining, recommendations, real‐time Cost or complexity: working more efficiently ▪ Consolidating / rehosting to simplify and reduce cost What’s desired is possible but limited by the cost of  growing and supporting the existing environment. Page 20
  21. Relational has a good conceptual model, but a  prematurely standardized implementation The relational database is the franchise technology for storing and  retrieving data, but… 1. Global, static schema model 2. No rich typing system 3. No concept of ordering, creating challenges with e.g. time series 4. Many are not a good fit for network parallel computing, aka cloud 5. Limited API in atomic SQL statement syntax  & simple result set return 6. Poor developer support
  22. Big data? Unstructured data isn’t  really unstructured. The problem is that this  data is unmodeled. The real challenge is  complexity.
  23. Text, Objects and Data Don’t Always Fit Together So this is what they meant by “impedance mismatch”
  24. Many new choices, one way to look at them http://blog.nahurst.com/visual-guide-to-nosql-systems
  25. What About Analytics? Machine  learning Visualization Statistics GIS Advanced  Analytic  Information  Methods Numerical  theory & IR methods Rules  Text mining  engines &  & text  constraint  analytics programming
  26. The holy grail of databases under current market hype A key problem is that we’re  talking mostly about  computation over data when we  talk about “big data” and  analytics, a potential mismatch  for both relational and nosql.
  27. Technologies are not  perfect replacements for  one another. When replacing the old  with the new (or ignoring  the new over the old) you  always make tradeoffs,  and usually you won’t see  them for a long time.
  28. Scalability and performance are not the same thing
  29. Performance measures Throughput: the number of  tasks completed in a given  time period A measure of how much  work is or can be done by a  system in a set amount of  time, e.g. TPM or data  loaded per hour. It’s easy to increase  throughput without  improving response time. Page 29
  30. Performance measures Response time: the speed  of a single task Response time is usually  the measure of an  individual's experience  using a system.  Response time =  time interval / throughput Page 30
  31. Scalability vs throughput vs response time Scalability = consistent performance for a task over an  increase in a scale factor
  32. Scale: Data Volume The different ways people count  make establishing rules of thumb  for sizing hard. How do you measure it? ▪ Row counts ▪ Transaction counts ▪ Data size ▪ Raw data vs loaded data ▪ Schema objects People still have trouble scaling for  databases as large as a single PC  hard drive.
  33. Scale: Concurrency (active and passive)
  34. Scalability relationships As concurrency  increases, response time  (usually) decreases, This can be addressed  somewhat via workload  management tools. When a system hits a  bottleneck, response  time and throughput will  often get worse, not just  level off.
  35. Scale: Computational Complexity
  36. A key point worth remembering: Performance over size <> performance over complexity Analytics performance is about the intersection of both. Database performance for BI is mostly related to size and  query complexity. Size, computational complexity and concurrency are the three  dimensions that constrain a product’s performance. Workloads fall  somewhere along all three.
  37. Solving Your Problem Depends on the Diagnosis
  38. Three General Workloads Online Transaction Processing ▪ Read, write, update ▪ User concurrency is the common performance limiter ▪ Low data, compute complexity Business Intelligence / Data warehousing ▪ Assumed to be read‐only, but really read heavy, write heavy,  usually separated in time ▪ Data size is the common performance limiter ▪ High data complexity, low compute complexity Analytics ▪ Read, write ▪ Data size and complexity of algorithm are the limiters ▪ Moderate data , high compute complexity
  39. Types of workloads Write‐biased:  Read‐biased: ▪ OLTP ▪ Query ▪ OLTP, batch ▪ Query, simple retrieval ▪ OLTP, lite ▪ Query, complex ▪ Object persistence ▪ Query‐hierarchical /  ▪ Data ingest, batch object / network ▪ Data ingest, real‐time ▪ Analytic Mixed Inline analytic execution, operational BI
  40. Technology choice depends  on workload & need Optimizing for: ▪ Response time? ▪ Throughput? ▪ both? Concerned about rapid growth  in data? Unpredictable spikes in use? Extremely low latency (in or  out) requirements? Bulk loads or incremental  inserts and/or updates?
  41. Important workload parameters to know • Read‐intensive  vs. write‐intensive
  42. Important workload parameters to know • Read‐intensive  vs. write‐intensive • Mutable vs. immutable data
  43. Important workload parameters to know • Read‐intensive  vs. write‐intensive • Mutable vs. immutable data • Immediate vs. eventual consistency
  44. Important workload parameters to know • Read‐intensive  vs. write‐intensive • Mutable vs. immutable data • Immediate vs. eventual consistency • Short vs. long access latency
  45. Important workload parameters to know • Read‐intensive  vs. write‐intensive • Mutable vs. immutable data • Immediate vs. eventual consistency • Short vs. long data latency • Predictable vs. unpredictable data access patterns
  46. Important workload parameters to know • Read‐intensive  vs. write‐intensive • Mutable vs. immutable data • Immediate vs. eventual consistency • Short vs. long data latency • Predictable vs. unpredictable data access patterns • Simple vs. complex data types
  47. You must understand your  workload mix ‐ throughput  and response time  requirements aren’t enough. ▪ 100 simple queries accessing  month‐to‐date data ▪ 90 simple queries accessing  month‐to‐date data and 10  complex queries using two  years of history ▪ Hazard calculation for the  entire customer master ▪ Performance problems are  rarely due to a single factor. 
  48. Selectivity and number of columns queried Row store or column store, indexed or not? Chart from “The Mimicking Octopus: Towards a one-size-fits-all Database Architecture”, Alekh Jindal
  49. Characteristics of query workloads Workload Selectivity Retrieval Repetition Complexity Reporting / BI Moderate Low Moderate Moderate Dashboards /  Moderate Low High Low scorecards Ad‐hoc query and  Low to  Moderate Low Low to  analysis high to low moderate Analytics (batch) Low High Low to High Low* Analytics (inline) High Low High Low* Operational /  High Low High Low embedded BI * Low for retrieving the data, high if doing analytics in SQL
  50. Characteristics of read‐write workloads Workload Selectivity Retrieval Repetition Complexity Online OLTP High Low High Low Batch OLTP Moderate to  Moderate  High Moderate to  low to high high Object  High Low High Low persistence Bulk ingest Low (write) n/a High Low Realtime ingest High (write) n/a High Low With ingest workloads we’re dealing with write-only, so selectivity and retrieval don’t apply in the same way, instead it’s write volume.
  51. Workload parameters and DB types at data scale Workload  Write‐ Read‐ Updateable Eventual  Un‐ Compute parameters biased biased data consistency  predictable intensive ok? query path Standard  RDBMS Parallel RDBMS NoSQL (kv, dht, obj) Hadoop* Streaming  database You see the problem: it’s an intersection of multiple
  52. Problem: Architecture Can Define Options
  53. A general rule for the read‐write axes As workloads increase in both  intensity and complexity, we move  into a realm of specialized databases  adapted to specific workloads. NewSQL Read intensity NoSQL OldSQL Write intensity
  54. In general… Relational row store databases for conventionally tooled  low to mid‐scale OLTP Relational databases for ACID requirements Parallel databases (row or column) for unpredictable or  variable query workloads Specialized databases for complex data query workjloads NoSQL (KVS, DHT) for high scale OLTP NoSQL (KVS, DHT) for low latency read‐mostly data access Parallel databases (row or column) for analytic workloads  over tabular data NoSQL / Hadoop for batch analytic workloads over large  data volumes
  55. How To Select A Database
  56. How To Select A Database - (1) 1. What are the data management requirements and policies (if any) in respect of: - Data security (including regulatory requirements)? - Data cleansing? - Data governance? - Deployment of solutions in the cloud? - If a deployment environment is mandated, what are its technical characteristics and limitations? Best of breed, no standards for anything, “polyglot persistence” = silos on steroids, data integration challenges, shifting data movement architectures 2. What kind of data will be stored and used? - Is it structured or unstructured? - Is it likely to be one big table or many tables?
  57. How To Select A Database - (2) 3. What are the data volumes expected to be? - What is the expected daily ingest rate? - What will the data retention/archiving policy be? - How big do we expect the database to grow to? (estimate a range). 4. What are the applications that will use the database? - Estimate by user numbers and transaction numbers - Roughly classify transactions as OLTP, short query, long query, long query with analytics. - What are the expectations in respect of growth of usage (per user) and growth of user population? 5. What are the expected service levels? - Classify according to availability service levels - Classify according to response time service levels - Classify on throughput where appropriate
  58. How To Select A Database - (3) 6. What is the budget for this project and what does that cover? 7. What is the outline project plan? - Timescales - Delivery of benefits - When are costs incurred? 8. Who will make up the project team? - Internal staff - External consultants - Vendor consultants 9. What is the policy in respect of external support, possibly including vendor consultancy for the early stages of the project?
  59. How To Select A Database - (4) 10.What are the business benefits? - Which ones can be quantified financially? - Which ones can only be guessed at (financially)? - Are there opportunity costs?
  60. A random selection of databases Sybase IQ, ASE EnterpriseDB Algebraix Teradata, Aster Data LucidDB Intersystems Caché Oracle, RAC Vectorwise Streambase Microsoft SQLServer, PDW MonetDB SQLStream IBM DB2s, Netezza Exasol Coral8 Paraccel Illuminate Ingres Kognitio Vertica Postgres EMC/Greenplum InfiniDB Cassandra Oracle Exadata 1010 Data CouchDB SAP HANA SAND Mongo Infobright Endeca Hbase MySQL Xtreme Data Redis MarkLogic IMS RainStor Tokyo Cabinet Hive Scalaris And a few hundred more…
  61. Product Selection Preliminary investigation Short-list (usually arrived at by elimination) Be sure to set the goals and control the process. Evaluation by technical analysis and modeling Evaluation by proof of concept. Do not be afraid to change your mind Negotiation
  62. Conclusion Wherein all is revealed, or ignorance exposed
  63. Thank You For Your Attention

×