SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
An Introduction
                             Kai Voigt, Cloudera Inc
                        Berlin Buzzwords, June 6th 2011




Montag, 6. Juni 2011
Big Data

                       •   Capture

                       •   Storage

                       •   Search

                       •   Analytics




Montag, 6. Juni 2011
hadoop.apache.org


                       Google Filesystem   Hadoop Distributed
                            (GFS)          Filesystem (HDFS)


                          MapReduce           MapReduce



Montag, 6. Juni 2011
Hadoop Distributed
                             Filesystem
                       • Easy Access
                       • Distributed
                       • Redundant
                       • Scalable

Montag, 6. Juni 2011
File Splits
                                             Large File


                                            6440M



                 Block   Block   Block   Block     Block   Block         Block   Block
                   1       2       3       4         5       6     ...    100     101


               64M 64M 64M 64M 64M 64M                                   64M 40M

Montag, 6. Juni 2011
Block Placement
                        Block     Block   Block   Block   Block
                          1         2       3       2       2


                        Block     Block           Block
                          3         1               3

                                                  Block
                                                    1



                       Node 1 Node 2 Node 3 Node 4 Node 5


Montag, 6. Juni 2011
MapReduce
                       Block    Block   Block   Block
     Input               1        2       3       4


                                                         map()
 Inter-
mediate
  Data
                                                        reduce()

 Output


Montag, 6. Juni 2011
WordCount Example
                  map (offset, line) {
                    foreach word in line {
                      emit (word, 1)
                    }
                  }




Montag, 6. Juni 2011
WordCount Example
                  reduce (word, count[]) {
                    total = 0;
                    foreach number in count[] {
                      total += number
                    }
                    emit (word, total)
                  }


Montag, 6. Juni 2011
map()



                       (   , 1)    (   , 1)   (   , 1)
                       (   , 1)    (   , 1)   (   , 1)
                       (   , 1)    (   , 1)   (   , 1)
                       (   , 1)    (   , 1)   (   , 1)



Montag, 6. Juni 2011
Sort & Shuffle
                       (   , 1)              (   , 1)             (   , 1)
                       (   , 1)              (   , 1)             (   , 1)
                       (   , 1)              (   , 1)             (   , 1)
                       (   , 1)              (   , 1)             (   , 1)


                           (      , (1,1))
                                                        (   , (1,1,1,1))
                           (      , (1,1))
                                                        (   , (1,1))
                           (      , (1))


Montag, 6. Juni 2011
reduce()
                       (       , (1,1))
                                          (   , (1,1,1,1))
                       (       , (1,1))
                                          (   , (1,1))
                       (       , (1))




                           (     , 2)         (   , 4)
                           (     , 2)         (   , 2)
                           (     , 1)

Montag, 6. Juni 2011
Use Case:
                           Recommendations
                       • "People looking as this article also looked
                         at these articles"
                       • "You might also know these people"
                       • iTunes Genius Playlist
                       • Banner Placement

Montag, 6. Juni 2011
Use Case: Text
                               Processing

                       • Document Indexing
                       • Semantic Analytics


Montag, 6. Juni 2011
Use Case: Machine
                              Learning
                                                  Data
                                                   Data
                                                    Data

                       • Spam vs No Spam
                       • Credit Card Fraud
                       • "People of Interest"
                                                Information




Montag, 6. Juni 2011
Use Case: Graphs

                       • Shortest Paths
                       • Bottleneck Nodes
                       • Flow Optimization
                       • Spanning Trees
                       • Spanning Routes

Montag, 6. Juni 2011
Hadoop Ecosystem
                       Hive & Pig   High Level Language Access
                        HBase            Real Time Access
                        Sqoop          SQL to/from Hadoop
                         Flume      Distributed Data Collection
                         Oozie            Job Workflow
                        Mahout       Machine Learning Library

                                    many more


Montag, 6. Juni 2011
Homework

                       • Cloudera's Distribution including Hadoop
                         (CDH3)
                       • Online Tutorials
                       • WordCount Example
                       • Conference Demo Cluster

Montag, 6. Juni 2011
Quick Demo!



Montag, 6. Juni 2011
Thank you!

                       • Kai Voigt
                       • kai@cloudera.com
                       • http://www.cloudera.com/
                       • http://apache.hadoop.org/

Montag, 6. Juni 2011

Mais conteúdo relacionado

Mais de iammutex

Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagramiammutex
 
Redis深入浅出
Redis深入浅出Redis深入浅出
Redis深入浅出iammutex
 
深入了解Redis
深入了解Redis深入了解Redis
深入了解Redisiammutex
 
NoSQL误用和常见陷阱分析
NoSQL误用和常见陷阱分析NoSQL误用和常见陷阱分析
NoSQL误用和常见陷阱分析iammutex
 
MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用iammutex
 
8 minute MongoDB tutorial slide
8 minute MongoDB tutorial slide8 minute MongoDB tutorial slide
8 minute MongoDB tutorial slideiammutex
 
Thoughts on Transaction and Consistency Models
Thoughts on Transaction and Consistency ModelsThoughts on Transaction and Consistency Models
Thoughts on Transaction and Consistency Modelsiammutex
 
Rethink db&tokudb调研测试报告
Rethink db&tokudb调研测试报告Rethink db&tokudb调研测试报告
Rethink db&tokudb调研测试报告iammutex
 
redis 适用场景与实现
redis 适用场景与实现redis 适用场景与实现
redis 适用场景与实现iammutex
 
Introduction to couchdb
Introduction to couchdbIntroduction to couchdb
Introduction to couchdbiammutex
 
What every data programmer needs to know about disks
What every data programmer needs to know about disksWhat every data programmer needs to know about disks
What every data programmer needs to know about disksiammutex
 
redis运维之道
redis运维之道redis运维之道
redis运维之道iammutex
 
Realtime hadoopsigmod2011
Realtime hadoopsigmod2011Realtime hadoopsigmod2011
Realtime hadoopsigmod2011iammutex
 
[译]No sql生态系统
[译]No sql生态系统[译]No sql生态系统
[译]No sql生态系统iammutex
 
Couchdb + Membase = Couchbase
Couchdb + Membase = CouchbaseCouchdb + Membase = Couchbase
Couchdb + Membase = Couchbaseiammutex
 
Redis cluster
Redis clusterRedis cluster
Redis clusteriammutex
 
Redis cluster
Redis clusterRedis cluster
Redis clusteriammutex
 

Mais de iammutex (20)

Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagram
 
Redis深入浅出
Redis深入浅出Redis深入浅出
Redis深入浅出
 
深入了解Redis
深入了解Redis深入了解Redis
深入了解Redis
 
NoSQL误用和常见陷阱分析
NoSQL误用和常见陷阱分析NoSQL误用和常见陷阱分析
NoSQL误用和常见陷阱分析
 
MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用
 
8 minute MongoDB tutorial slide
8 minute MongoDB tutorial slide8 minute MongoDB tutorial slide
8 minute MongoDB tutorial slide
 
skip list
skip listskip list
skip list
 
Thoughts on Transaction and Consistency Models
Thoughts on Transaction and Consistency ModelsThoughts on Transaction and Consistency Models
Thoughts on Transaction and Consistency Models
 
Rethink db&tokudb调研测试报告
Rethink db&tokudb调研测试报告Rethink db&tokudb调研测试报告
Rethink db&tokudb调研测试报告
 
redis 适用场景与实现
redis 适用场景与实现redis 适用场景与实现
redis 适用场景与实现
 
Introduction to couchdb
Introduction to couchdbIntroduction to couchdb
Introduction to couchdb
 
What every data programmer needs to know about disks
What every data programmer needs to know about disksWhat every data programmer needs to know about disks
What every data programmer needs to know about disks
 
Ooredis
OoredisOoredis
Ooredis
 
Ooredis
OoredisOoredis
Ooredis
 
redis运维之道
redis运维之道redis运维之道
redis运维之道
 
Realtime hadoopsigmod2011
Realtime hadoopsigmod2011Realtime hadoopsigmod2011
Realtime hadoopsigmod2011
 
[译]No sql生态系统
[译]No sql生态系统[译]No sql生态系统
[译]No sql生态系统
 
Couchdb + Membase = Couchbase
Couchdb + Membase = CouchbaseCouchdb + Membase = Couchbase
Couchdb + Membase = Couchbase
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
 

Hadoop introduction berlin buzzwords 2011

  • 1. An Introduction Kai Voigt, Cloudera Inc Berlin Buzzwords, June 6th 2011 Montag, 6. Juni 2011
  • 2. Big Data • Capture • Storage • Search • Analytics Montag, 6. Juni 2011
  • 3. hadoop.apache.org Google Filesystem Hadoop Distributed (GFS) Filesystem (HDFS) MapReduce MapReduce Montag, 6. Juni 2011
  • 4. Hadoop Distributed Filesystem • Easy Access • Distributed • Redundant • Scalable Montag, 6. Juni 2011
  • 5. File Splits Large File 6440M Block Block Block Block Block Block Block Block 1 2 3 4 5 6 ... 100 101 64M 64M 64M 64M 64M 64M 64M 40M Montag, 6. Juni 2011
  • 6. Block Placement Block Block Block Block Block 1 2 3 2 2 Block Block Block 3 1 3 Block 1 Node 1 Node 2 Node 3 Node 4 Node 5 Montag, 6. Juni 2011
  • 7. MapReduce Block Block Block Block Input 1 2 3 4 map() Inter- mediate Data reduce() Output Montag, 6. Juni 2011
  • 8. WordCount Example map (offset, line) { foreach word in line { emit (word, 1) } } Montag, 6. Juni 2011
  • 9. WordCount Example reduce (word, count[]) { total = 0; foreach number in count[] { total += number } emit (word, total) } Montag, 6. Juni 2011
  • 10. map() ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) Montag, 6. Juni 2011
  • 11. Sort & Shuffle ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , 1) ( , (1,1)) ( , (1,1,1,1)) ( , (1,1)) ( , (1,1)) ( , (1)) Montag, 6. Juni 2011
  • 12. reduce() ( , (1,1)) ( , (1,1,1,1)) ( , (1,1)) ( , (1,1)) ( , (1)) ( , 2) ( , 4) ( , 2) ( , 2) ( , 1) Montag, 6. Juni 2011
  • 13. Use Case: Recommendations • "People looking as this article also looked at these articles" • "You might also know these people" • iTunes Genius Playlist • Banner Placement Montag, 6. Juni 2011
  • 14. Use Case: Text Processing • Document Indexing • Semantic Analytics Montag, 6. Juni 2011
  • 15. Use Case: Machine Learning Data Data Data • Spam vs No Spam • Credit Card Fraud • "People of Interest" Information Montag, 6. Juni 2011
  • 16. Use Case: Graphs • Shortest Paths • Bottleneck Nodes • Flow Optimization • Spanning Trees • Spanning Routes Montag, 6. Juni 2011
  • 17. Hadoop Ecosystem Hive & Pig High Level Language Access HBase Real Time Access Sqoop SQL to/from Hadoop Flume Distributed Data Collection Oozie Job Workflow Mahout Machine Learning Library many more Montag, 6. Juni 2011
  • 18. Homework • Cloudera's Distribution including Hadoop (CDH3) • Online Tutorials • WordCount Example • Conference Demo Cluster Montag, 6. Juni 2011
  • 20. Thank you! • Kai Voigt • kai@cloudera.com • http://www.cloudera.com/ • http://apache.hadoop.org/ Montag, 6. Juni 2011