O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
The Artful Business
                        of Data Mining
                            Distributed Schema-less
           ...
David Coallier
                         @davidcoallier



Wednesday 27 March 13
Data Scientist
                         At Engine Yard (.com)




Wednesday 27 March 13
RDBMs

Wednesday 27 March 13
Structure
          Restrictions
          Safety
Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad   ...
id    name      age    address

                        1     david       1     315
                        2     divad   ...
id    name      age    address

                        1     david       1     315
                        2     divad   ...
id    name      age    address

                        1     david       1     315
                        2     divad   ...
id    name      age    address

                        1     david       1     315
                        2     divad   ...
What If?


Wednesday 27 March 13
id    name      age    address   phone

                        1     david      26     IE        353
                    ...
Before
                   Moving on
Wednesday 27 March 13
JSON

Wednesday 27 March 13
What is JSON?


Wednesday 27 March 13
{
                            "firstName": "David",
                            "lastName": "Coallier",
                  ...
What is HTTP?


Wednesday 27 March 13
What is a Schema?


Wednesday 27 March 13
Alternative

Wednesday 27 March 13
Schema-less


Wednesday 27 March 13
Does
      NOT
      Mean
      Structure-less
Wednesday 27 March 13
Documents
      and
      K-V Buckets
Wednesday 27 March 13
CouchDB
                        Cluster of unreliable commodity hardware




Wednesday 27 March 13
Replication Attachments
               Generated “random” ids
               Dictionary Revisions?
               JSON Obj...
Documents

Wednesday 27 March 13
Wednesday 27 March 13
{
                            "_id": "131dafsd1vasd",
                            "_rev": "12-fva32asdf",
                ...
How do you
      find
      Anything?
Wednesday 27 March 13
Map/Reduce

Wednesday 27 March 13
...

Wednesday 27 March 13
Riak

Wednesday 27 March 13
Dynamo
     Paper
Wednesday 27 March 13
CAP
     Theorem
Wednesday 27 March 13
Key-Value
  Buckets
Wednesday 27 March 13
Differences?

Wednesday 27 March 13
CouchDB                                      Riak
           Storage Model         append-only                            ...
Map/Reduce

Wednesday 27 March 13
Mapper:
Executed on document

Reducer:
Receives output from mappers


Wednesday 27 March 13
{
            {                                         {                    {
                                     "_id":...
{
            {                                         {                    {
                                     "_id":...
{
                  "age": "32",
                  "heads": "3",
 }

Wednesday 27 March 13
Map: find-ages

                                 {
            {                                         {                 ...
Map: find-ages
                function find_ages(doc) {
                  if (typeof(doc.age) != undefined) {
            ...
Map: find-ages

                                 {
            {                                         {                 ...
Map: find-ages

                                 {
            {                                         {                 ...
Map: find-ages

               26       32   42   17

              Reduce: sum

Wednesday 27 March 13
Reduce: sum

    function sum(values) {
      return sum(values);
    }


Wednesday 27 March 13
Map: find-ages

               26       32    42   17

              Reduce: sum
                             117
Wednesday...
Mapper:
Executed on document

Reducer:
Receives output from mappers


Wednesday 27 March 13
So
     What?
Wednesday 27 March 13
The
     Machines
     They Lurn.
Wednesday 27 March 13
The
     Problem
Wednesday 27 March 13
Statistics
     Example
Wednesday 27 March 13
Mean,
  Std. Deviation
  Age
Wednesday 27 March 13
n
                1
             µ = ∑ xi
                n i=1
Wednesday 27 March 13
n
           1
        σ=   ∑
           n i=1
                 (xi − µ ) 2




Wednesday 27 March 13
Mapper:
Executed on document

Reducer:
Receives output from mappers


Wednesday 27 March 13
Mapper:
  Retrieve values, pre-process

Reducer:
 Receive, process further.


Wednesday 27 March 13
{
            {                                         {                    {
                                     "_id":...
[
                            [ 26, 676],
                            [ 32, 1024],
                            [ 42, 1764]...
/**
                          * Our mapper function.
                          */
                        map: function(do...
/**
   * Our mapper function.
   */
 map: function(doc) {
    emit(null, [doc.age, doc.age * doc.age]);
 }

 /**
  * Our r...
Naive
  Bayes
Wednesday 27 March 13
Real Life
  Fraud
Wednesday 27 March 13
P(x j = k | y = fraudulent)
  P(x j = k | y = normal)
  P(y)

Wednesday 27 March 13
We need to:
  Sum x j = k , for each y
  to calculate P(x|y)



Wednesday 27 March 13
We need:
   More than 1 mapper.




Wednesday 27 March 13
We need

                          4
                        mappers
Wednesday 27 March 13
Mapper #1:
   ∑1i P(x = k | y = fraudulent)
                        j




Wednesday 27 March 13
Mapper #2:
   ∑1i P(x = k | y = normal)
                        j




Wednesday 27 March 13
Mapper #3:
   ∑1i P(y = fraudulent)

Wednesday 27 March 13
Mapper #4:
   ∑1i P(y = normal)


Wednesday 27 March 13
Reducer
         Sums up
         results for
         parameters
Wednesday 27 March 13
Cluster
  Analysis
Wednesday 27 March 13
k-means

Wednesday 27 March 13
Mapper:
 Divide vectors into subgroups,
 Calculate d(p,q) between
 vectors, find centroids,
 sum them up.

 Reducer:
 Sum u...
Próximos SlideShares
Carregando em…5
×

de

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 1 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 2 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 3 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 4 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 5 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 6 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 7 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 8 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 9 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 10 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 11 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 12 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 13 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 14 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 15 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 16 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 17 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 18 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 19 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 20 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 21 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 22 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 23 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 24 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 25 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 26 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 27 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 28 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 29 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 30 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 31 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 32 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 33 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 34 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 35 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 36 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 37 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 38 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 39 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 40 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 41 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 42 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 43 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 44 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 45 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 46 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 47 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 48 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 49 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 50 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 51 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 52 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 53 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 54 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 55 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 56 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 57 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 58 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 59 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 60 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 61 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 62 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 63 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 64 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 65 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 66 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 67 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 68 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 69 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 70 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 71 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 72 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 73 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 74 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 75 The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases Slide 76
Próximos SlideShares
molson coors brewing 200710K
Avançar
Transfira para ler offline e ver em ecrã inteiro.

0 gostaram

Compartilhar

Baixar para ler offline

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

Baixar para ler offline

Data comes in all forms and shapes. Data also evolves as life and people adapt to new situations, and so should your database.

When working with data, traditional relational database systems come to mind because that is how most of us have been trained. However, data is rarely homogeneous, and your database should not force you into a certain schema if your data is not relational.

During this talk we analyse the composition of "documents" in the context of a document-based database, and cover the basic principles of Map-Reduce and its potential use in the context of computational statistics.

What then happens when the amount of data you have no longer fits on 1 server? How easy is it for your favourite database to currently expand and adapt to your new growing requirements? What is your contingency plan if your server goes down?

We then go over some of the features that CouchDB, Riak and MongoDB provide you with, alongside some of David's personal opinions.

This is an intermediary talk. Listeners should have a working concept of Bayesian statistics, standard internet protocols as such as HTTP, and a minimum understanding of programming languages as such as JavaScript and Erlang as some of the examples for those database are using those languages.

  • Seja a primeira pessoa a gostar disto

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

  1. 1. The Artful Business of Data Mining Distributed Schema-less Document-Based Databases Wednesday 27 March 13
  2. 2. David Coallier @davidcoallier Wednesday 27 March 13
  3. 3. Data Scientist At Engine Yard (.com) Wednesday 27 March 13
  4. 4. RDBMs Wednesday 27 March 13
  5. 5. Structure Restrictions Safety Wednesday 27 March 13
  6. 6. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  7. 7. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  8. 8. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  9. 9. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  10. 10. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  11. 11. What If? Wednesday 27 March 13
  12. 12. id name age address phone 1 david 26 IE 353 2 divad 27 US 1 3 foo 42 IE 353 4 bar 31 CA 1 5 john 17 NZ 131 6 jack 128 DK 311 7 jill 21 IE 353 ... ... ... ... ... Wednesday 27 March 13
  13. 13. Before Moving on Wednesday 27 March 13
  14. 14. JSON Wednesday 27 March 13
  15. 15. What is JSON? Wednesday 27 March 13
  16. 16. { "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 27 March 13
  17. 17. What is HTTP? Wednesday 27 March 13
  18. 18. What is a Schema? Wednesday 27 March 13
  19. 19. Alternative Wednesday 27 March 13
  20. 20. Schema-less Wednesday 27 March 13
  21. 21. Does NOT Mean Structure-less Wednesday 27 March 13
  22. 22. Documents and K-V Buckets Wednesday 27 March 13
  23. 23. CouchDB Cluster of unreliable commodity hardware Wednesday 27 March 13
  24. 24. Replication Attachments Generated “random” ids Dictionary Revisions? JSON Objects HTTP CRUD Wednesday 27 March 13
  25. 25. Documents Wednesday 27 March 13
  26. 26. Wednesday 27 March 13
  27. 27. { "_id": "131dafsd1vasd", "_rev": "12-fva32asdf", "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 27 March 13
  28. 28. How do you find Anything? Wednesday 27 March 13
  29. 29. Map/Reduce Wednesday 27 March 13
  30. 30. ... Wednesday 27 March 13
  31. 31. Riak Wednesday 27 March 13
  32. 32. Dynamo Paper Wednesday 27 March 13
  33. 33. CAP Theorem Wednesday 27 March 13
  34. 34. Key-Value Buckets Wednesday 27 March 13
  35. 35. Differences? Wednesday 27 March 13
  36. 36. CouchDB Riak Storage Model append-only bitcask Access HTTP HTTP, PB Retrieval Views(M/R) M/R, Indexes, Search Versioning Eventual Consistency Vector Clocks Concurrency No Locking Client Resolution Replication master/master/slave replication, clustering Scaling In/Out Big Couch Built-in Management Futon/Fuxton Riak Control http://guide.couchdb.org http://downloads.basho.com/papers/bitcask-intro.pdf Wednesday 27 March 13
  37. 37. Map/Reduce Wednesday 27 March 13
  38. 38. Mapper: Executed on document Reducer: Receives output from mappers Wednesday 27 March 13
  39. 39. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  40. 40. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  41. 41. { "age": "32", "heads": "3", } Wednesday 27 March 13
  42. 42. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  43. 43. Map: find-ages function find_ages(doc) { if (typeof(doc.age) != undefined) { emit(doc._id, doc.age); } } Wednesday 27 March 13
  44. 44. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  45. 45. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } 26 32 42 17 Wednesday 27 March 13
  46. 46. Map: find-ages 26 32 42 17 Reduce: sum Wednesday 27 March 13
  47. 47. Reduce: sum function sum(values) { return sum(values); } Wednesday 27 March 13
  48. 48. Map: find-ages 26 32 42 17 Reduce: sum 117 Wednesday 27 March 13
  49. 49. Mapper: Executed on document Reducer: Receives output from mappers Wednesday 27 March 13
  50. 50. So What? Wednesday 27 March 13
  51. 51. The Machines They Lurn. Wednesday 27 March 13
  52. 52. The Problem Wednesday 27 March 13
  53. 53. Statistics Example Wednesday 27 March 13
  54. 54. Mean, Std. Deviation Age Wednesday 27 March 13
  55. 55. n 1 µ = ∑ xi n i=1 Wednesday 27 March 13
  56. 56. n 1 σ= ∑ n i=1 (xi − µ ) 2 Wednesday 27 March 13
  57. 57. Mapper: Executed on document Reducer: Receives output from mappers Wednesday 27 March 13
  58. 58. Mapper: Retrieve values, pre-process Reducer: Receive, process further. Wednesday 27 March 13
  59. 59. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  60. 60. [ [ 26, 676], [ 32, 1024], [ 42, 1764], [ 17, 289 ] ] Wednesday 27 March 13
  61. 61. /** * Our mapper function. */ map: function(doc) { emit(null, [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = 0; var summed = 0; var summedSquare = 0; for (var i in values) { N += 1; summed += values[i][0]; summedSquare += values[i][1]; } var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquare / N) - (mean* mean) ) return [mean, standard_deviation] } Wednesday 27 March 13
  62. 62. /** * Our mapper function. */ map: function(doc) { emit(null, [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = values.length; var summed = sum(values.map(function(v) { return v[0]; })); var summedSquares = sum(values.map(function(v) { return v[1];})); var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquares / N) - (mean*mean) ) return [mean, standard_deviation] } Wednesday 27 March 13
  63. 63. Naive Bayes Wednesday 27 March 13
  64. 64. Real Life Fraud Wednesday 27 March 13
  65. 65. P(x j = k | y = fraudulent) P(x j = k | y = normal) P(y) Wednesday 27 March 13
  66. 66. We need to: Sum x j = k , for each y to calculate P(x|y) Wednesday 27 March 13
  67. 67. We need: More than 1 mapper. Wednesday 27 March 13
  68. 68. We need 4 mappers Wednesday 27 March 13
  69. 69. Mapper #1: ∑1i P(x = k | y = fraudulent) j Wednesday 27 March 13
  70. 70. Mapper #2: ∑1i P(x = k | y = normal) j Wednesday 27 March 13
  71. 71. Mapper #3: ∑1i P(y = fraudulent) Wednesday 27 March 13
  72. 72. Mapper #4: ∑1i P(y = normal) Wednesday 27 March 13
  73. 73. Reducer Sums up results for parameters Wednesday 27 March 13
  74. 74. Cluster Analysis Wednesday 27 March 13
  75. 75. k-means Wednesday 27 March 13
  76. 76. Mapper: Divide vectors into subgroups, Calculate d(p,q) between vectors, find centroids, sum them up. Reducer: Sum up the sums, get new centroids. Wednesday 27 March 13

Data comes in all forms and shapes. Data also evolves as life and people adapt to new situations, and so should your database. When working with data, traditional relational database systems come to mind because that is how most of us have been trained. However, data is rarely homogeneous, and your database should not force you into a certain schema if your data is not relational. During this talk we analyse the composition of "documents" in the context of a document-based database, and cover the basic principles of Map-Reduce and its potential use in the context of computational statistics. What then happens when the amount of data you have no longer fits on 1 server? How easy is it for your favourite database to currently expand and adapt to your new growing requirements? What is your contingency plan if your server goes down? We then go over some of the features that CouchDB, Riak and MongoDB provide you with, alongside some of David's personal opinions. This is an intermediary talk. Listeners should have a working concept of Bayesian statistics, standard internet protocols as such as HTTP, and a minimum understanding of programming languages as such as JavaScript and Erlang as some of the examples for those database are using those languages.

Vistos

Vistos totais

784

No Slideshare

0

De incorporações

0

Número de incorporações

1

Ações

Baixados

13

Compartilhados

0

Comentários

0

Curtir

0

×