O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Don't follow the followers

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 99 Anúncio

Don't follow the followers

Baixar para ler offline

The problems of scale, speed, persistence and context are the most important design problem we'll have to deal with during the next decade. Scale because we're creating and recording more data than at any time in human history – much of it of dubious value, but none of it obviously value-less. Speed because data flows now. Ceaselessly. In high volume. It has to be persisted in multiple latencies, from milliseconds to decades. And context because the context of creation is different from the context of transmission is different from the context of use.
There are a lot of red herrings, false premises and just-plain-dementia that get in the way of us seeing the problem clearly. We must work through what we mean by "structured" and "unstructured", what we mean by “big data” and why we need new technologies to solve some of our data problems. But “new technologies” doesn’t mean reinventing old technologies while ignoring the lessons of the past. There are reasons relational databases survived while hierarchical, document and object databases were market failures, technologies that may be poised to fail again, 20 years later.

What we believe about data’s structure, schema, and semantics are as important as the NoSQL and relational databases we use. The technologies impose constraints on the real problem: how we make sense of data in order to tell a computer what to do, or to inform human decisions. Most discussions of data and code lose the unconscious tradeoffs that are made when selecting these technology handcuffs.

The problems of scale, speed, persistence and context are the most important design problem we'll have to deal with during the next decade. Scale because we're creating and recording more data than at any time in human history – much of it of dubious value, but none of it obviously value-less. Speed because data flows now. Ceaselessly. In high volume. It has to be persisted in multiple latencies, from milliseconds to decades. And context because the context of creation is different from the context of transmission is different from the context of use.
There are a lot of red herrings, false premises and just-plain-dementia that get in the way of us seeing the problem clearly. We must work through what we mean by "structured" and "unstructured", what we mean by “big data” and why we need new technologies to solve some of our data problems. But “new technologies” doesn’t mean reinventing old technologies while ignoring the lessons of the past. There are reasons relational databases survived while hierarchical, document and object databases were market failures, technologies that may be poised to fail again, 20 years later.

What we believe about data’s structure, schema, and semantics are as important as the NoSQL and relational databases we use. The technologies impose constraints on the real problem: how we make sense of data in order to tell a computer what to do, or to inform human decisions. Most discussions of data and code lose the unconscious tradeoffs that are made when selecting these technology handcuffs.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Quem viu também gostou (20)

Semelhante a Don't follow the followers (20)

Anúncio

Mais de mark madsen (20)

Mais recentes (20)

Anúncio

Don't follow the followers

  1. 1. Copyright  Third  Nature,  Inc.   Following  Google   Or   Don’t  Follow  the  Followers,   Follow  the  Leaders   Or   The  problem  probably  isn’t  the   database,  the  problem  is   probably  you   May,  2014   Mark  Madsen   www.ThirdNature.net   @markmadsen  
  2. 2. Copyright  Third  Nature,  Inc.    A  Quick  Intro   I  may  or  may  not  be   qualified  to  make  the   outlandish  statements  in   this  presentaDon.  I  have,   however,  made  almost   every  mistake  menDoned,   and  one  learns  by  making   mistakes.   You  might  say  that’s  my   singular  skill.  
  3. 3. Copyright  Third  Nature,  Inc.   A  BRIEF  HISTORY  OF  DATA   STORAGE  AND  RETRIEVAL   History  isn’t  taught  in  most  university  science  curricula     (probably  because  it’s  a  rabbit  hole)    
  4. 4. Copyright  Third  Nature,  Inc.   Databases:  the  problem  statements  over  Gme   “InformaDon  has  become  a  form  of  garbage,  not  only   incapable  of  answering  the  most  fundamental  human   quesDons  but  barely  useful  in  providing  coherent   direcDon  to  the  soluDon  of  even  mundane  problems.”  –     Neil  Postman,  1985   "We  have  reason  to  fear  that  the  mulDtude  of  books   which  grows  every  day  in  a  prodigious  fashion  will  make   the  following  centuries  fall  into  a  state  as  barbarous  as   that  of  the  centuries  that  followed  the  fall  of  the  Roman   Empire.”  –  Adrien  Baillet,  1685   ”…so  many  books  that  we  do  not  even  have  Dme  to  read   the  Dtles."  –  Anton  Francesco  Doni,  1550  
  5. 5. Copyright  Third  Nature,  Inc.   The  origin  of  informaGon  management  problems   For  ~5000  years  we  used   counters  of  various  types,   eventually  developing   wriDng  to  cope  with   civilizaDon’s  needs.   WriDng  is  more  efficient   than  counters  you  can  lose.   Sumerian  bulla  envelope  with  tokens.   The  beta  period.  
  6. 6. Copyright  Third  Nature,  Inc.   InformaGon  Technology  v1.0:  Clay  Tech,  ~3000  bce   The  first  informaGon  explosion  
  7. 7. Copyright  Third  Nature,  Inc.   That  explosion  led  to  the  first  metadata   “tags” Small piles in baskets are easy to tag and search
  8. 8. Copyright  Third  Nature,  Inc.   Metadata  v1.0:  tablets  about  tablets   When  there  are  enough  of   these  lying  around  you  need  to   work  on  organizaDon  of  the   collecDon  by  categorizaDons,   aka  “taxonomy”,  “schema”   Like  working  out  what  tables   are  in  a  database,  or  what  files   are  stored  in  HDFS.   Babylonian  library  catalog  ~2000bce  
  9. 9. Copyright  Third  Nature,  Inc.   Metadata  v1.1:  tablets  about  what’s  in  tablets   When literacy rates are higher and people need to communicate more effectively, you need to invent mechanisms to cope, like dictionaries. Now we’re worried about what’s inside the documents, not where they are placed. Synonym list, Ashurbanipal, ~900 bce
  10. 10. Copyright  Third  Nature,  Inc.   Clay  Tech  has  some  familiar  limitaGons  
  11. 11. Copyright  Third  Nature,  Inc.   InformaGon  Management  v2.0  Paper  Tech*   Lighter,  denser,  faster  storage  media  
  12. 12. Copyright  Third  Nature,  Inc.   More  informaGon  =  need  for  new  metadata   techniques:  content  tagging,  author  catalogs   The  first  real  library  ~300BC  
  13. 13. Copyright  Third  Nature,  Inc.   Discovery  of  one  tradeoff  between  clay  and  paper…   Recorded  informaDon  creates  permanence  and  instability  
  14. 14. Copyright  Third  Nature,  Inc.   You  can’t  have  disconGnuous  reading*  unGl  you   have  a  random  access  technology.   *Indexing  and  encyclopedias  are  hard  in  linear  scrolls.  Hello  ISAM  
  15. 15. Copyright  Third  Nature,  Inc.   Paper  Tech  v2.1:  increased  storage  density,  smaller   form  factor,  durability,  high  res  RGB  graphics  
  16. 16. Copyright  Third  Nature,  Inc.   Paper  Tech  v2.2   The  change  in  prinDng   over  Dme  accelerates.   Block  prinDng  replaced  by   movable  metal  type.   The  job  of  producDon  is   faster  and  cheaper.   CommodiDzaDon  changes   the  landscape  over  the   next  200  years.   The  printed  becomes   more  important  than  the   printer.  
  17. 17. Copyright  Third  Nature,  Inc.   The  Elizabethan  Era   ProducDon:  prinDng  presses   Data  management  tech:   ▪  Perfect  copies   ▪  Topical  catalogs   ▪  Font  standardizaDon   ▪  Taxonomy  ascends   InformaDon  explosion:     ▪  8M  books  in  1500   ▪  200M  by  1600   ▪  CommodiDzaDon   ▪  Overload  
  18. 18. Copyright  Third  Nature,  Inc.   Be^er  embedded  metadata:  Gtle  page,  colophon,  ToC  
  19. 19. The  Georgian  Era:  The  Explosion  of  Natural  Philosophy   Sharing knowledge in a larger community required common language, structure
  20. 20. Copyright  Third  Nature,  Inc.   Linnaeus   Top  down  orientaDon   StaDc  structure   DescripDve  rather  than   explanatory   Taxonomic  classificaHon  
  21. 21. Copyright  Third  Nature,  Inc.   Bobom  up  orientaDon   Flexible  structure   Explanatory,  descripDve   Faceted  classificaHon   Buffon  
  22. 22. Copyright  Third  Nature,  Inc.   vs   vs   SQL  NoSQL   Think about why that happened
  23. 23. Copyright  Third  Nature,  Inc.   The  Victorian  Era   The  powered  prinDng   informaDon  explosion:   ▪  Card  catalogs,  cross-­‐ referencing,  random   access  metadata   ▪  Universal  classificaDon   ▪  Extended  informaDon   management  debates   ▪  Trading  effort  and   flexibility  for  storage   and  retrieval   ▪  Stereotyping  
  24. 24. Copyright  Third  Nature,  Inc.   Melvil  Dewey   Dewey  Decimal  System   Top  down  orientaDon   StaDc  structure   DescripDve  rather  than   explanatory   Taxonomic  classificaHon  
  25. 25. Copyright  Third  Nature,  Inc.   Cuber  Expansive   ClassificaDon  System   (~1882)   Bobom  up  orientaDon   More  flexible  structure   Explanatory,  descripDve   Faceted  classificaHon   Charles  Ammi  Cu^er  
  26. 26. Copyright  Third  Nature,  Inc.   vs   SQL  NoSQL  
  27. 27. Copyright  Third  Nature,  Inc.   So  why  did  Linnaeus  and  Dewey  win?   Good  enough   wins  the  day    PragmaJsm   It  wasn’t  solving   the  problem  you   thought  it  was.    
  28. 28. Copyright  Third  Nature,  Inc.   Summarizing   Thousands  of  years  of  thought  have  been  put  into   principles  of  organizaDon  and  use.  The  abstract    paberns   are  the  same,  only  the  implementaDon  changed.   ▪  Clay:  tablets  about  tablets,  tablets  about  what’s  in  tablets,   100X  increase  in  data  density  over  counDng  tech   ▪  Scrolls:  scrolls  about  scrolls,  scrolls  about  what’s  in  scrolls,   prepended/appended  navigaDon,  >100X  increase  in  density   ▪  Books:  books  about  books,  books  about  what’s  in  books,   embedded  internal  navigaDon,  >1000X  increase  in  density   ▪  DigiDzed  data:  similar,  far  denser,  and  different  because  it   isn’t  locked  into  physical  forms  
  29. 29. Copyright  Third  Nature,  Inc.   History  is  always  the  same   Every  technology  is  a  trade:   ▪  Top  down  vs.  bobom  up   ▪  Authority  vs.  anarchy   ▪  Bureaucracy  vs.  autonomy   ▪  Control  vs.  creaDvity   ▪  Hierarchy  vs.  network   ▪  Dynamic  vs.  staDc   ▪  Power  vs.  ease   ▪  Work  up  front  vs.  postponed   In every choice, something is lost and something is gained.
  30. 30. Copyright  Third  Nature,  Inc.   What  lessons  does  this  history  teach  us?   1.  InformaDon  requires  organizing  principles.   2.  Differences  in  scale  require  different  principles.   3.  There  are  mulDple  levels  of  informaDon   architecture  and  principles  of  organizaDon.     4.  At  a  key  point  in  the  adopDon  cycle,  emphasis   shins  from  collecDon  and  management  of   informaDon  to  disseminaDon  and  use.    First  we  record,  then  we  use  and  share.   Like  transacDon  processing,  query  &  analysis.  
  31. 31. Copyright  Third  Nature,  Inc.   InformaGon  management  through  human  history   always  follows  the  same  pa^ern   New  technology  development   creates   New  methods  to  cope   creates   New  informaGon  scale  and  availability   creates…  
  32. 32. Copyright  Third  Nature,  Inc.   Big  Data  
  33. 33. Copyright  Third  Nature,  Inc.   DEALING  WITH  BIG:   SOME  SCALING  HISTORY   "The  most  amazing  achievement  of  the  computer   sonware  industry  is  its  conDnuing  cancellaDon  of   the  steady  and  staggering  gains  made  by  the   computer  hardware  industry.“  -­‐  Henry  Peteroski  
  34. 34. Copyright  Third  Nature,  Inc.   Why  doesn’t  your  database  scale?  
  35. 35. Copyright  Third  Nature,  Inc.   Hipster  bullshit   I  can’t  get  MySQL  to  scale   therefore   RelaDonal  databases  don’t  scale   therefore   We  must  use  NoSQL*  for  query  too   *including  Hadoop  and  related  
  36. 36. Copyright  Third  Nature,  Inc.   GeneraGon  and  collecGon  always  come  first   UnDl  there’s  enough  informaDon,  general  problems  of   organizaDon  and  informaDon  architecture  don’t  appear.  
  37. 37. Copyright  Third  Nature,  Inc.   The  early  days:  client/server  as  the  starGng  point   We  had  transacDon  processing  against  the  DB  all  on   the  same  machine.  Then  on  two  separate  machines.   Application
  38. 38. Copyright  Third  Nature,  Inc.   Scaling  client/server   We  added  app  servers  and  pooled  connecDons.   Client App serverClient Client
  39. 39. Copyright  Third  Nature,  Inc.   Scaling  client/server   Then  threw  money  at  the  problem  in  the  form  of   hardware  (made  the  database  bigger).   Client App serverClient Client
  40. 40. Copyright  Third  Nature,  Inc.   Web  apps  were  a  huge  increase  in  concurrency   Architecture  changed  to  reflect  new  stateless  model.   We  had  scalability  and  availability  problems.   webapp webapp Loadbalancer Loadbalancer service service service
  41. 41. Copyright  Third  Nature,  Inc.   Increasing  traffic?   Keep  adding  hardware,  make  the  DB  bigger.   Limits  reached,  performance,  scalability  and   availability  problems.   webapp webapp Loadbalancer Loadbalancer service service service
  42. 42. Copyright  Third  Nature,  Inc.   Increasing  traffic?   Read-­‐only  replicas  will  save  the  day!   SDll  have  scalability  and  availability  problems.   And  now  operaDonal  overhead  and  problems.   webapp webapp Loadbalancer Loadbalancer service service service
  43. 43. Copyright  Third  Nature,  Inc.   Increasing  traffic?   Sharding  seems  a  fine  thing.  But  it’s  one  leber  from…   Scaling  and  perf  beber,  overhead  and  operaDonal   complexity  high  and  worsening.     webapp webapp Loadbalancer Loadbalancer service service service Loadbalancer
  44. 44. Copyright  Third  Nature,  Inc.   Increasing  traffic?   Let’s  cache  data  at  the  service  Der!   Performance  beber,  overhead  and  operaDonal   complexity  higher.     webapp webapp Loadbalancer Loadbalancer service service service Loadbalancer cachecachecache
  45. 45. Copyright  Third  Nature,  Inc.   What  are  the  problems  now?   1.  More  hardware,  more  things  to  break   2.  More  management  and  administraDon   3.  More  sonware  complexity   4.  Increasing  distance  for  data  to  travel  =  latency   5.  Data  administraDon  difficult  to  impossible  
  46. 46. Copyright  Third  Nature,  Inc.   Problem  solved?   Distributed  NoSQL  DB  (handles  cache,  load  balance,   data  distribuDon).  Similar  performance,  simpler   scaling,  reduced  operaDonal  problems,  simpler   applicaDon  architecture.  Finished!     webapp webapp Loadbalancer Loadbalancer service service service
  47. 47. Copyright  Third  Nature,  Inc.   TANSTAAFL   When  replacing  the  old   with  the  new  (or  ignoring   the  new  over  the  old)  you   always  make  tradeoffs,   and  usually  you  won’t  see   them  for  a  long  Dme.   Technologies  are  not   perfect  replacements  for   one  another.  Onen  not   beber,  only  different.  
  48. 48. Copyright  Third  Nature,  Inc.   Not  finished:  remember  the  cycle  of  history…   The  biggest  hole  in  the  prior  secDon  on  scaling  is   that  we  scaled  OLTP,  what  about  OLAP?   Queries  <>  transacDons.  
  49. 49. Copyright  Third  Nature,  Inc.   Solving  query  problems   Aggregate  or  low  selecDvity  queries  were  a  problem   early  on,  when  people  wanted  to  use  the  data.   Every  report  or  query  is  a  program.   OLTP Report app Report app
  50. 50. Copyright  Third  Nature,  Inc.   Increasing  data  volume   Make  it  faster  by  throwing  money  at  hardware   (sound  familiar?)   OLTP Report app Report app
  51. 51. Copyright  Third  Nature,  Inc.   Increasing  data  volume   Replicas:  split  the  workload  and  tune  the  systems   based  on  their  workload.   Report app OLTP Report app
  52. 52. Copyright  Third  Nature,  Inc.   Increasing  data  volume  breaks  the  old  model   Devise  a  new  architecture.     ReschemaDze  the  database,  eliminate  cyclic  joins,   selecDve  denormalizaDon,  query  generators.  But  it   takes  bulk  processing  to  reschemaDze  the  data.   Query tools OLTP OLTP OLTP ETL
  53. 53. Copyright  Third  Nature,  Inc.   Increasing  data  volume   Improve  response  Dme  with  caching  in  the  query   tools,  and  by  using  MOLAP  tools  that  map  into  cache   or  memory.  Like  mainframe  reporDng.  Or…   Query tools OLTP OLTP OLTP ETL cache Query tools cache
  54. 54. Copyright  Third  Nature,  Inc.   Increasing  data  volume   Parallel  processing  for  ETL.  Distributed  query   databases  for  fine  grained  high  volume  parallelism.   ETLETLETL OLTP OLTP OLTP Query tools cache Query tools cache
  55. 55. Copyright  Third  Nature,  Inc.   The  architecture  looks  familiar   Two  workloads,  two  not  dissimilar  architectures:   ▪  Load-­‐balanced  front  ends   ▪  Distributed  caching  layers   ▪  Scalable  distributed  parallel  databases   But  the  nature  of  the  OLTP  and  OLAP  workloads  is   very  different.  Forcing  them  into  one  plaworm  is   almost  impossible  for  data  architecture  reasons   and  parDcularly  at  scale*  
  56. 56. Copyright  Third  Nature,  Inc.   DATA  PERSISTENCE  AND  STORES   Why  would  digital  data  be  any  different  than  clay  or   scrolls  or  books?  
  57. 57. Copyright  Third  Nature,  Inc.   “Big  data  is  unprecedented.”   -­‐  Anyone  involved  with  big  data  in  even  the   most  barely  percepHble  way  
  58. 58. Copyright  Third  Nature,  Inc.   There’s  a  difference   between  having  no  past   and  acDvely  rejecDng  it.  
  59. 59. Copyright  Third  Nature,  Inc.   MultiValue, Hierarchical PICK IMS IDS ADABAS Relational CODASYL System R (SEQUEL) SQL/DS INGRES (QUEL) Mimer Oracle RDBMS, SQL standard DB2 Teradata Informix Sybase Postgres OODBMS, ORDBMS Versant Objectivity Gemstone Informix* Oracle* MPP Query, NoSQL Netezza Paraccel Vertica MongoDB CouchBase Riak Cassandra NewSQL SciDB MonetDB NuoDB CitusDB 1960s 1980s 2000s 1970s 1990s 2010s Spanner F1
  60. 60. Copyright  Third  Nature,  Inc.   A  history  of  databases  in  No-­‐taGon   1970:  NoSQL  =  We  have  no  SQL     1980:  NoSQL  =  Know  SQL     2000:  NoSQL  =  No  SQL!     2005:  NoSQL  =  Not  only  SQL     2013:  NoSQL  =  No,  SQL!   (R)DB(MS)  
  61. 61. Copyright  Third  Nature,  Inc.   Tradeoffs?   “Query  opHmizaHon   is  not  rocket  science.   When  you  flunk  out  of   query  opHmizaHon,   we  make  you  go  build   rockets.”  
  62. 62. Copyright  Third  Nature,  Inc.   Joins.  Wait,  there’s  more  than  one?   Wait, there’s more than one?
  63. 63. Copyright  Third  Nature,  Inc.   In  NoSQL  Land,  OpGmizer  is  You!  
  64. 64. Copyright  Third  Nature,  Inc.   Services  provided   Standard  API/query  layer*   TransacDon  /  consistency   Query  opDmizaDon   Data  navigaDon,  joins   Data  access   Storage  management   Database Database Tradeoffs:  In  NoSQL,  the  DBMS  is  You  Too   SQL  database   NoSQL  database   Application Application Anything  not  done  by  the  DB  becomes  a  developer’s  task.  
  65. 65. Copyright  Third  Nature,  Inc.   The  relaDonal  database  is  the  franchise  technology  for  storing  and   retrieving  data,  but…   1.  Global,  staDc  schema  model   2.  No  rich  typing  system   3.  No  management  of  natural  ordering  in  data   4.  Many  are  not  a  good  fit  for  network  parallel  compuDng,  aka  cloud   5.  Limited  API  in  atomic  SQL  statement  syntax    &  simple  result  set  return   6.  Poor  developer  support  (in  languages,  in  IDEs,  in  processes)   RelaGonal:  a  good  conceptual  model,  but  a   prematurely  standardized  implementaGon  
  66. 66. Copyright  Third  Nature,  Inc.   The  relaDonal  database  is  the  franchise  technology  for  storing  and   retrieving  data,  but…   1.  Global,  staDc  schema  model   2.  No  rich  typing  system   3.  No  management  of  natural  ordering  in  data   4.  Many  are  not  a  good  fit  for  network  parallel  compuDng,  aka  cloud   5.  Limited  API  in  atomic  SQL  statement  syntax    &  simple  result  set  return   6.  Poor  developer  support   RelaGonal:  a  good  conceptual  model,  but  a   prematurely  standardized  implementaGon   What  I  did  not  list:   Scalability  and  performance  
  67. 67. Copyright  Third  Nature,  Inc.   BIGNESS  AND  SCALABILITY  
  68. 68. Copyright  Third  Nature,  Inc.   Technology  Capability  and  Data  Volume   Source: Noumenal, Inc.
  69. 69. Copyright  Third  Nature,  Inc.   You  can  make  a  database  emulate  a  KVS   If  you  map  the  shared  event  fields  to  fixed  columns  and   an  event  type,  then  use  a  varchar  or  clob  payload   column,  you  can  store  arbitrary  events  in  a  database  and   query  them  via  SQL  and  views  (or  column  funcDons),  and   do  it  all  in  a  single  table.   Arbitrary data parsed at query time using native DB features like regex or UDF Data common to all events in the database Type code used to differentiate event payload formats.
  70. 70. Data  Warehouse  +   Behavioral   Singularity Data  Warehouse   Low End Enterprise-class System Contextual-Complex Analytics Deep, Seasonal, Consumable Data Sets Production Data Warehousing Large Concurrent User-base + + 150+ concurrent users 500+ concurrent users Enterprise-class System 5-10 concurrent users Structure the Unstructured Detect Patterns Commodity Hardware System 6+PB 40+PB 20+PB Analyze  &  Report   Discover  &  Explore   Data  Plagorms   Thanks to eBay for these case slides.
  71. 71. Start_dt   Guid   Sess_id   Page_id   Soj   2011-­‐10-­‐18   1234   1   15   Language=en&   source=hp&   itm=i1,i2,i3,i4,i5   SELECT start_dt, guid, sess_id, page_id, NVL(e.soj, ‘itm’) AS item_list FROM event e WHERE e.start_dt = ‘2011-10-18’ AND e.page_id = 3286 /* Search Results */ Start_dt   Guid   Sess_id   Page_id   Item_list   2011-­‐10-­‐18   1234   1   15   i1,i2,i3,i4,i5   How  to  use  a  database  for  semi-­‐structured  data   First the user-defined function (UDF) Thanks to eBay for these case slides.
  72. 72. WITH event (start_dt, item_list) AS (<previous SQL>) SELECT start_dt, item_id, /* Individual Item */ count(*) FROM TABLE ( /* Normalize comma delimited list */ normalize_list( start_dt, item_list, ‘,’) RETURNS(start_dt, idx, item_id) ) GROUP BY 1, 2 ORDER BY 3 DESC Start_dt   Item_id   Count(*)   2011-­‐10-­‐18   i1   555   2011-­‐10-­‐18   i2   444   2011-­‐10-­‐18   i3   333   2011-­‐10-­‐18   i4   222   2011-­‐10-­‐18   i5   111   *syntax simplified How  to  use  a  database  for  semi-­‐structured  data   Then the table function (standard ANSI SQL) Thanks to eBay for these case slides.
  73. 73. Copyright  Third  Nature,  Inc.   Pricing  and  performance:  Hadoop  is  a  storage   and  processing  play,  not    a  database  play*   Source: Venturebeat With  big  data  systems,   the  cost  of  storing  data  is   an  order  of  magnitude   lower  than  with   databases  today  (but  not   the  cost  or  ability  to   query  it  back  out).   Processing  data  at  scale   is  at  least  an  order  of   magnitude  cheaper  too.   Copyright  Third  Nature,  Inc.  
  74. 74. Copyright  Third  Nature,  Inc.   Bigness:  most  people  do  not  need  special  technology  Numberofpeople The distribution of data size is about normal, yet these guys set the tone of the market today. Bigness of data Copyright  Third  Nature,  Inc.  
  75. 75. Copyright  Third  Nature,  Inc.   AnalyGcs:  This  is  really  raw  data  under  storage  Numberofjobs Microsoft study of 174,000 analytic jobs in their cluster: median size ??? Bigness of data Copyright  Third  Nature,  Inc.  
  76. 76. Copyright  Third  Nature,  Inc.   Working  data  for  analyGcs  most  olen  not  big  Numberofjobs 14 GB Smallness of data Copyright  Third  Nature,  Inc.  
  77. 77. © Third Nature Inc.© Third Nature Inc. What  makes  data  “big”?   Very  large  amounts   Hierarchical  structures   Nested  structures   Encoded  values   Non-­‐standard  (for  a  database)   types   Time  series   Deep  structure   Human  authored  text   “big”  is  beMer  off  being  defined  as  “complex”  or  “hard  to  manage”   Copyright  Third  Nature,  Inc.  
  78. 78. Copyright  Third  Nature,  Inc.   Web  tracking  data  has  a  nested  structure   USER_ID 301212631165031 SESSION_ID 590387153892659 VISIT_DATE 1/10/2010  0:00 SESSION_START_DATE 1:41:44  AM PAGE_VIEW_DATE 1/10/2010  9:59 DESTINATION_URL hbps://www.phisherking.com/gins/store/LogonForm? mmc=link-­‐src-­‐email-­‐_-­‐m100109-­‐_-­‐44IOJ1-­‐_-­‐ shop&langId=-­‐1&storeId=1055&URL=BECGinListItemDisplay REFERRAL_NAME Direct REFERRAL_URL -­‐ PAGE_ID PROD_24259_CARD REL_PRODUCTS PROD_24654_CARD,  PROD_3648_FLOWERS SITE_LOCATION_NAME VALENTINE'S  DAY  MICROSITE SITE_LOCATION_ID SHOP-­‐BY-­‐HOLIDAY  VALENTINES  DAY IP_ADDRESS 67.189.110.179 BROWSER_OS_NAME MOZILLA/4.0  (COMPATIBLE;  MSIE  7.0;  AOL  9.0;  WINDOWS  NT   5.1;  TRIDENT/4.0;  GTB6;  .NET  CLR  1.1.4322) “unstructured” data embedded in the logged message: complex strings …but it’s easily unpacked into tuples
  79. 79. © Third Nature Inc.© Third Nature Inc. Common  Names   ▪  Aspirin,  Acetylsalicylic  acid,  Excedrin   Structural  formulas   ▪  Commonly  used  to  communicate  between  chemists   SystemaDc  nomenclatures:   ▪  Mass  formula:  C9H8O4   ▪  SMILES:  OC(=O)C1=C(C=CC=C1)OC(=O)C   ▪  InChI:  1/C9H8O4/c1-­‐6(10)13-­‐8-­‐5-­‐3-­‐2-­‐4-­‐7(8)9(11)12/h2-­‐5H,1H3,(H, 11,12)   ▪  IUPAC:  pyrido[1",2":1',2']imidazo[4',5':5,6]pyrazino[2,3-­‐b]phenazine   They  all  refer  to  the  same  thing.   If  you  process  them  in  a  database,  how  do  you  store  them?   All  of  these  things  are  “unstructured”  data  
  80. 80. Copyright  Third  Nature,  Inc.   Unstructured  data  isn’t   really  unstructured.   The  problem  is  that  this   data  is  unmodeled.   The  real  challenge  is   complexity.   We  usually  mean  text,  not  data  structures…  
  81. 81. Copyright  Third  Nature,  Inc.   Pa^erns  emerge  from  lots  of  event  data   Paberns  emerge  from   the  underlying  structure   of  the  enHre  dataset.   The  paberns  are  more   interesDng  than  sums   and  counts  of  the  events.   Web  paths:  clicks  in  a   session  as  network  node   traversal.   Email:  traffic  analysis   producing  a  network   The event stream is a source for analysis, generating another set of data that is the source for different analysis.
  82. 82. Copyright  Third  Nature,  Inc.   BIGNESS  AND  DATA   COMPUTATIONAL  WORKLOADS  
  83. 83. Copyright  Third  Nature,  Inc.   Not  finished:  remember  the  cycle  of  history…   The  biggest  hole  in  the  prior  secDons  is  that  we   scaled  OLTP  and  OLAP  but  what  about  analyDcs?   Queries  <>  transacDons  <>  computaDons  
  84. 84. Copyright  Third  Nature,  Inc.   The  three  way  workload  break   1.  OperaDonal:  OLTP  systems   2.  AnalyDc:  OLAP  systems   3.  ScienDfic:  ComputaDonal  systems   Unit  of  focus:   1.  TransacDon   2.  Query   3.  ComputaDon   Different  problems  require  different  plaworms  
  85. 85. Copyright  Third  Nature,  Inc.   The  holy  grail  of  databases  under  current  market  hype   We’re  talking  mostly  about   computaGon  over  data  when   we  talk  about  “big  data”  and   analyGcs.   The  goal  is  combining  data   storage,  retrieval  and  analysis   into  one  system,  a  potenJal   mismatch  for  both  relaJonal   and  nosql. Copyright  Third  Nature,  Inc.  
  86. 86. Copyright  Third  Nature,  Inc.   A  Simple  Division  of  the  AnalyGc  Problem  Space  Computation LittleLots Data volume Little Lots Big analytics, little data Specialized computing, modeling problems: supercomputing, GPUs Big analytics, big data Complex math over large data volumes requires non- relational shared nothing architectures Little analytics, little data The entry point; SAS, SMP databases, even OLAP cubes can work Little analytics, big data The BI/DW space, for the most part, done in databases mostly
  87. 87. Copyright  Third  Nature,  Inc.   No  technical  soluGon  fits  all  three  axes   Big   ComputaGon   Parallel  DSLs,   HPC,  GPU,   MapReduce   Concurrency   Parallel  DBs,   nosql  stores*   Volume   Parallel  DBs,   MapReduce,   BSP     Some work along two axes, but most have a primary focus on one core component
  88. 88. Copyright  Third  Nature,  Inc.   Data  infrastructure  is  a  plagorm   ▪  Any  data  –  structures,  forms   ▪  Any  latency  –in  moDon,  at  rest   ▪  Any  process  –  query,  algorithm,  engine,   transformaDon   ▪  Any  access  –  SQL,  API,  queue,  file  movement  
  89. 89. Copyright  Third  Nature,  Inc.   Hadoop  &  NoSQL  AdopGon   Some  people  can’t  resist   ge„ng  the  next  new  thing   because  it’s  new.   Many  organizaDons  are  like   this,  promoDng  a  soluDon  and   hunDng  for  the  problem  that   matches  it.   Beber  to  ask  “What  is  the   problem  for  which  this   technology  is  the  answer?”  
  90. 90. Copyright  Third  Nature,  Inc.   NoSQL  Will  “Fail”   Unless  some  mathemaDcally  derived  data  model  is   developed,  and  a  query  language  using  it  is  created.   Otherwise  there’s  no  standard  interfacing  model,   no  interoperability,  no  chance  for  a  tool  ecosystem   to  evolve  over  top  of  the  plaworm  –  because  there   are  no  uniform  plaworm  boundaries.   “One  logical  interface,  many  physical   implementaDons”  is  a  key  reason  why  SQL  won  the   database  wars.  This  creates  an  ecosystem.  
  91. 91. Copyright  Third  Nature,  Inc.   Maybe…   These aren’t the databases we’re looking for. We need one that speaks pig latin.
  92. 92. Copyright  Third  Nature,  Inc.   What’s  wrong  with  BASE?  
  93. 93. Copyright  Third  Nature,  Inc.   Google  F1:  Another  EvoluGon   Distributed  SQL  database   ACID  compliance,  2PC  and  row-­‐ level  locking  (!)   Transparent  data  distribuDon   Synchronous  replicaDon  across   data  centers   Table  interleaving  (hierarchies)   Queryable  protobufs   MapReduce  access  to  underlying   data   Average  user-­‐facing  latency  of   ~200ms  with  small  deviaDon  
  94. 94. Copyright  Third  Nature,  Inc.   The  big  data  revoluGon,  more  of  an  evoluGon   Be  pragmaGc,  not  dogmaGc  
  95. 95. Copyright  Third  Nature,  Inc.   Conclusion  
  96. 96. Copyright  Third  Nature,  Inc.   References  (things  worth  reading  on  the  way  home)  A  relaDonal  model  for  large  shared  data  banks,  CommunicaDons  of  the  ACM,  June,  1970,   hbp://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf   Column-­‐Oriented  Database  Systems,  Stavros  Harizopoulos,  Daniel  Abadi,  Peter  Boncz,  VLDB  2009  Tutorial   hbp://cs-­‐www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf   Nobody  ever  got  fired  for  using  Hadoop  on  a  cluster,  1st  InternaDonal  Workshop  on  Hot  Topics  in  Cloud  Data   ProcessingApril  10,  2012,  Bern,  Switzerland.   A  co-­‐RelaDonal  Model  of  Data  for  Large  Shared  Data  Banks,  ACM  Queue,  2012,   hbp://queue.acm.org/detail.cfm?id=1961297   A  query  language  for  mulDdimensional  arrays:  design,  implementaDon  and  opDmizaDon  techniques,  SIGMOD,   1996   ProbabilisDcally  Bounded  Staleness  for  PracDcal  ParDal  Quorums,  Proceedings  of  the  VLDB  Endowment,  Vol.  5,   No.  8,  hbp://vldb.org/pvldb/vol5/p776_peterbailis_vldb2012.pdf   “Amorphous  Data-­‐parallelism  in  Irregular  Algorithms”,  Keshav  Pingali  et  al   MapReduce:  Simplified  Data  Processing  on  Large  Clusters,   hbp://staDc.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/ mapreduce-­‐osdi04.pdf   Dremel:  InteracDve  Analysis  of  Web-­‐Scale  Datasets,  Proceedings  of  the  VLDB  Endowment,  Vol.  3,  No.  1,  2010   hbp://staDc.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/ 36632.pdf   Spanner:  Google’s  Globally-­‐Distributed  Database,  SIGMOD,  May,  2012,   hbp://staDc.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/es//archive/ spanner-­‐osdi2012.pdf   F1:  A  Distributed  SQL  Database  That  Scales,  Proceedings  of  the  VLDB  Endowment,  Vol.  6,  No.  11,  2013,   hbp://staDc.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/ archive/41344.pdf  
  97. 97. Copyright  Third  Nature,  Inc.   CC  Image  A^ribuGons   Thanks  to  the  people  who  supplied  the  creaDve  commons  licensed  images  used  in  this  presentaDon:   shady_puppy_sales.jpg  -­‐  hbp://www.flickr.com/photos/brizzlebornandbred/5001120150   cuneiform_proto_3000bc.jpg  -­‐  hbp://www.flickr.com/photos/takomabibelot/3124619443/   cuneiform_undo.jpg  -­‐  hbp://www.flickr.com/photos/charlesDlford/2552654321/   scroll_kerouac.jpg  -­‐  hbp://www.flickr.com/photos/ari/93966538/   House  on  fire  -­‐  hbp://flickr.com/photos/oldonliner/1485881035/   Manuscripts  on  shelf  -­‐  hbp://flickr.com/photos/peterkaminski/1688635/   manuscript_illum.jpg  -­‐  hbp://www.flickr.com/photos/diorama_sky/2975796332/   manuscript_page.jpg  -­‐  hbp://www.flickr.com/photos/calliope/306564541/   subway  dc  metro    -­‐  hbp://flickr.com/photos/musaeum/509899161/  
  98. 98. Copyright  Third  Nature,  Inc.   About  the  Presenter   Mark  Madsen  is  president  of  Third   Nature,  a  research  and  consulDng  firm   focused  on  building  the  infrastructure  for   analyDcs,  evidence-­‐based  management,   business  intelligence  and  data   management.  Mark  is  an  award-­‐winning   author,  architect  and  CTO  whose  work   has  been  featured  in  numerous  industry   publicaDons.  Over  the  past  ten  years   Mark  received  awards  for  his  work  from   the  American  ProducDvity  &  Quality   Center,  TDWI,  and  the  Smithsonian   InsDtute.  He  is  an  internaDonal  speaker,  a   contributor  at  Forbes  Online  and   InformaDon  Management.  For  more   informaDon  or  to  contact  Mark,  follow   @markmadsen  on  Twiber  or  visit    hbp:// ThirdNature.net    
  99. 99. Copyright  Third  Nature,  Inc.   About  Third  Nature   Third Nature is a research and consulting firm focused on new and emerging technology and practices in business intelligence, analytics and performance management. If your question is related to BI, analytics, information strategy and data then you‘re at the right place. Our goal is to help companies take advantage of information-driven management practices and applications. We offer education, consulting and research services to support business and IT organizations as well as technology vendors. We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.

×