SlideShare a Scribd company logo
1 of 145
Download to read offline
© 2013 IBM Corporation1
AVNET – Hadoop Fundamentals I
Romeo Kienzler
IBM Innovation Center Zurich
© 2013 IBM Corporation2
1) Welcome
2) What is big data?
3) Introduction to Hadoop
4) BigInsights
5) Hadoop architecture
6) Lab 1 – Core Hadoop
7) MapReduce
8) Lab 2 – MapReduce
9) Pig, Jaql, Hive, BigSQL, SystemT/AQL
10) Lab 3 – Pig, Hive, and Jaql
11) Certification on BigDataUniversity
Agenda
© 2013 IBM Corporation3
What is BIG data?
© 2013 IBM Corporation4
Traditional Business Intelligence / Data
Warehousing
...60 percent, were unsatisfied with their data warehousing system.¹
¹http://www.information-management.com/issues/20010601/3494-1.html
© 2013 IBM Corporation5
What is BIG data?
© 2013 IBM Corporation6
What is BIG data?
© 2013 IBM Corporation7
What is BIG data?
Big Data
Hadoop
© 2013 IBM Corporation8
What is BIG data?
Business Intelligence
Data Warehouse
© 2013 IBM Corporation9
Map-Reduce → Hadoop → BigInsights
© 2013 IBM Corporation1010
Why is Big Data important?
Data AVAILABLE to an
organization
data an organization can
PROCESS
Missed
opportunity
Enterprises are “more blind”
to new opportunities.
Organizations are able to
process less and less of the
available data.
100 Millionen Tweets are posted every day, 35 hours of video are beeing uploaded every
minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed
through the net. 80 % spam and viruses. => Prefiltering is more and more important.
© 2013 IBM Corporation11
Why is Big Data important?
© 2013 IBM Corporation12
Why is Big Data important?
© 2013 IBM Corporation13
Why is Big Data important?
© 2013 IBM Corporation1414
Volume
Terabytes, petabytes, even
exabytes
Variety
All kinds of data
All kinds of analytics
Velocity
Agility
Analyze data in. . .
Hours instead of days
Days instead of weeks
Dynamically responsive
Rapid data exploration
Traditional / Non-traditional
data sources
Store
Analyze
Explore
What is BIG data?
Volume*Variaty*Velocity=Value
© 2013 IBM Corporation15
BigData Analytics
© 2013 IBM Corporation16
BigData Analytics – Predictive Analytics
© 2013 IBM Corporation17
BigData Analytics – Predictive Analytics
© 2013 IBM Corporation18
BigData Analytics – Correlation / Text / NLP
© 2013 IBM Corporation19
BigData Analytics – Feature Extraction
Feature extraction involves simplifying the amount of resources
required to describe a large set of data accurately¹
¹: Wikipedia
© 2013 IBM Corporation20
BigData Analytics – Predictive Analytics
Storage / DataCPU’s / Algorithm
Business Value / Insight
© 2013 IBM Corporation21
BigData Analytics – Predictive Analytics
"sometimes it's not
who has the best
algorithm that wins;
it's who has the most
data."
(C) Google Inc.
The Unreasonable Effectiveness of Data¹
¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
No Sampling => Work with full dataset => Long Tail Distributions
© 2013 IBM Corporation22
Realtime / In-Memory Computing:
InfoSphere Streams / Watson
© 2013 IBM Corporation23
© 2013 IBM Corporation24
© 2013 IBM Corporation25
© 2013 IBM Corporation26
The Paris Hilton Problem
Watson Workshop: What is Watson?
© 2013 IBM Corporation27
Introduction to Hadoop
© 2013 IBM Corporation28
© 2013 IBM Corporation29
BigInsights
© 2013 IBM Corporation30
© 2013 IBM Corporation31
BigInsights Demonstration
© 2013 IBM Corporation32
Hadoop Architecture
© 2013 IBM Corporation33
© 2013 IBM Corporation34
© 2013 IBM Corporation35
HDFS – Hadoop File System
© 2013 IBM Corporation36
© 2013 IBM Corporation37
© 2013 IBM Corporation38
© 2013 IBM Corporation39
© 2013 IBM Corporation40
© 2013 IBM Corporation41
© 2013 IBM Corporation42
© 2013 IBM Corporation43
© 2013 IBM Corporation44
© 2013 IBM Corporation45
© 2013 IBM Corporation46
© 2013 IBM Corporation47
© 2013 IBM Corporation48
© 2013 IBM Corporation49
© 2013 IBM Corporation50
© 2013 IBM Corporation51
© 2013 IBM Corporation52
© 2013 IBM Corporation53
© 2013 IBM Corporation54
Lab 1 – Hadoop Architecture
1)Start from chapter 1.2
2)Replace /home/biadmin with /home/biadminX where X is your user ID
3)In chapter 1.3 skip task 1.3.1._1 and go to http://10.199.20.51:8080 instead
4)Skip 1.3.5
5)In chapter 1.3.6._30 use any file you like on your desktop computer
© 2013 IBM Corporation55
Map-Reduce
© 2013 IBM Corporation56
© 2013 IBM Corporation57
© 2013 IBM Corporation58
© 2013 IBM Corporation59
© 2013 IBM Corporation60
© 2013 IBM Corporation61
© 2013 IBM Corporation62
© 2013 IBM Corporation63
© 2013 IBM Corporation64
© 2013 IBM Corporation65
© 2013 IBM Corporation66
© 2013 IBM Corporation67
© 2013 IBM Corporation68
© 2013 IBM Corporation69
© 2013 IBM Corporation70
© 2013 IBM Corporation71
© 2013 IBM Corporation72
© 2013 IBM Corporation73
© 2013 IBM Corporation74
© 2013 IBM Corporation75
© 2013 IBM Corporation76
© 2013 IBM Corporation77
© 2013 IBM Corporation78
© 2013 IBM Corporation79
© 2013 IBM Corporation80
© 2013 IBM Corporation81
© 2013 IBM Corporation82
© 2013 IBM Corporation83
© 2013 IBM Corporation84
© 2013 IBM Corporation85
© 2013 IBM Corporation86
© 2013 IBM Corporation87
© 2013 IBM Corporation88
© 2013 IBM Corporation89
© 2013 IBM Corporation90
© 2013 IBM Corporation91
© 2013 IBM Corporation92
© 2013 IBM Corporation93
© 2013 IBM Corporation94
© 2013 IBM Corporation95
© 2013 IBM Corporation96
© 2013 IBM Corporation97
Data Parallelism
© 2013 IBM Corporation98
Aggregated Bandwith between CPU, Main
Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 sec
- 10 Nodes - 10 sec
- 100 Nodes - 1 sec
- 1000 Nodes - 100 msec
© 2013 IBM Corporation99
© 2013 IBM Corporation100
© 2013 IBM Corporation101
© 2013 IBM Corporation102
© 2013 IBM Corporation103
Lab 2 - MapReduce
1)Skip task 1.1._1, use putty to connect to biadmin@10.199.20.51 instead
2)Replace /home/biadmin with /home/biadminX where X is your user ID
3)In 1.1._4 - 1.1._6 replace output with with /home/biadminX/output where X is your user ID
4)Skip chapter 1.2
5)Chapter 1.3 is optional (using your local virtual machine), maybe during lunch break :)
© 2013 IBM Corporation104
Pig, Jaql, Hive, BigSQL, SystemT/AQL
© 2013 IBM Corporation105
© 2013 IBM Corporation106
© 2013 IBM Corporation107
© 2013 IBM Corporation108
© 2013 IBM Corporation109
© 2013 IBM Corporation110
© 2013 IBM Corporation111
© 2013 IBM Corporation112
© 2013 IBM Corporation113
© 2013 IBM Corporation114
© 2013 IBM Corporation115
© 2013 IBM Corporation116
© 2013 IBM Corporation117
© 2013 IBM Corporation118
© 2013 IBM Corporation119
© 2013 IBM Corporation120
© 2013 IBM Corporation121
© 2013 IBM Corporation122
© 2013 IBM Corporation123
© 2013 IBM Corporation124
© 2013 IBM Corporation125
© 2013 IBM Corporation126
© 2013 IBM Corporation127
© 2013 IBM Corporation128
© 2013 IBM Corporation129
© 2013 IBM Corporation130
© 2013 IBM Corporation131
© 2013 IBM Corporation132
© 2013 IBM Corporation133
SQL for BigInsights
 Data warehouse augmentation is a very common use case for Hadoop
 While highly scalable, MapReduce is notoriously difficult to use
– Java API is tedious and requires programming expertise
– Unfamiliar languages (e.g. Pig) also requiring expertise
– Many different file formats, storage mechanisms, configuration options, etc.
– Joins, grouping, sorting tedious to orchestrate
 SQL support opens the data to a much wider audience
– Familiar, widely known syntax
– Common catalog for identifying data and structure
– Clear separation of defining the what (you want) vs. the how (to get it)
© 2013 IBM Corporation134
Query Processing
 Big SQL consists of two query processing engines
– The SQL optimization engine
– Jaql as the query execution engine
Client
SQL Engine
Jaql
Jaql SQL
Optimizer
Runtime
© 2013 IBM Corporation135
Big SQL vs. Alternatives
 There are a number of SQL solutions, where does Big SQL fit in?
 Hive
– Open source
• Established Hadoop component
• Active development community
– Restrictive SQL syntax
• No subqueries (Hive 0.11 adds non-correlated subquery support)
• No windowed aggregates (Hive 0.11 adds windowed aggregate support)
• Ansi join syntax only
– Limited type support
• No varchar(n), decimal(p,s), etc.
– Poor client support
• Limited JDBC and ODBC drivers
– Poor low-latency query support (via local mapreduce)
© 2013 IBM Corporation136
Big SQL vs. Alternatives (cont.)
 Impala
– Recently open sourced
– Achieves low latency by bypassing MapReduce infrastructure
• Installs a completely separate execution infrastructure
• Can lead to resource scheduling conflicts
– Execution engine is C++
• Great for performance, makes extending difficult (e.g. UDF's & UDA's)
• Support for limited set of file formats
– Currently limited to broadcast joins
• All tables must fit in memory (aggregate cluster memory)
• Scalability limitation for larger clusters
– Uses Hive 0.9 query syntax (more limitations than the current Hive)
– Uses Hive 0.9 type system (more limitations than the current Hive)
© 2013 IBM Corporation137
© 2013 IBM Corporation138
© 2013 IBM Corporation139
© 2013 IBM Corporation140
© 2013 IBM Corporation141
Lab 3 – Querying Data with Pig, Hive, Jaql
1)putty to biadmin@10.199.20.51
2)Skip task 1.1._2, start jaql shell using command /opt/ibm/biginsights/jaql/bin/jaqlshell
3)In 1.1._5 replace biadmin with with biadminX where X is your user ID
4)Skip chapter 1.2 (optional using virtual machine)
5)In 1.3._2 replace biadmin with with biadminX where X is your user ID
6)Instead of task 1.3._2 type /opt/ibm/biginsights/pig/bin/pig
7)In 1.3._4 replace sampleData/NewsGroups.csv with /user/biadminX/sampleData/NewsGroups.csv
8)Skip chapter 1.4 (optional using virtual machine)
9)Skip 1.5._12 and _13 and type /opt/ibm/biginsights/hive/bin/hive instead
10)Type "use biadminX" where X is your user ID
11)continue with task _14
© 2013 IBM Corporation142
NoSQL Databases
 Column Store
– Hadoop / HBASE
– Cassandra
– Amazon Simple DB
 JSON / Document Store
– MongoDB
– CouchDB
 Key / Value Store
– Amazon DynamoDB
– Voldemort
 Graph DBs
– DB2 SPARQL Extension
– Neo4J
 MP RDBMS
– DB2 DPF, DB2 pureScale, PureData for Operational Analytics
– Oracle RAC
– Greenplum
http://nosql-database.org/ > 150
© 2013 IBM Corporation143
CAP Theorem / Brewers Theorem¹
 impossible for a distributed computer system simultaneously guarantee all 3 properties
– Consistency (all nodes see the same data at the same time)
– Availability (guarantee that every request knows whether it was successful or failed)
– Partition tolerance (continues to operate despite failure of part of the system)
 What about ACID?
– Atomicity
– Consistency
– Isolation
– Durability
 BASE, the new ACID
– Basically Available
– Soft state
– Eventual consistency
• Monotonic Read Consistency
• Monotonic Write Consistency
• Read Your Own Writes
© 2013 IBM Corporation144
Certification
 Go to www.bigdatauniversity.com
 Search for “hadoop fundamentals”
 Choose “Hadoop Fundamentals I – Version 2”
 Sign up
 Login with existing account or one of the following:
 Take the test:
© 2013 IBM Corporation145
Questions?

More Related Content

What's hot

Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop TechnologyOpenDev
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersXiao Qin
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 

What's hot (20)

Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop
Hadoop Hadoop
Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop Clusters
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop
HadoopHadoop
Hadoop
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 

Similar to Hadoop Fundamentals I

BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...Romeo Kienzler
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...Romeo Kienzler
 
JavaOne BOF 5957 Lightning Fast Access to Big Data
JavaOne BOF 5957 Lightning Fast Access to Big DataJavaOne BOF 5957 Lightning Fast Access to Big Data
JavaOne BOF 5957 Lightning Fast Access to Big DataBrian Martin
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Romeo Kienzler
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DanceEMC
 
In memory computing principles by Mac Moore of GridGain
In memory computing principles by Mac Moore of GridGainIn memory computing principles by Mac Moore of GridGain
In memory computing principles by Mac Moore of GridGainData Con LA
 
IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17David Spurway
 
Operational Intelligence Using Hadoop
Operational Intelligence Using HadoopOperational Intelligence Using Hadoop
Operational Intelligence Using HadoopDataWorks Summit
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Richard McDougall
 
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichData Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichRomeo Kienzler
 
Inovação e equipes geograficamente distribuídas - Palestrante: Maíra Gatti
Inovação e equipes geograficamente distribuídas - Palestrante: Maíra GattiInovação e equipes geograficamente distribuídas - Palestrante: Maíra Gatti
Inovação e equipes geograficamente distribuídas - Palestrante: Maíra GattiRio Info
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsStephan Reimann
 
Scaling MySQl 1 to N Servers -- Los Angelese MySQL User Group Feb 2014
Scaling MySQl 1 to N Servers -- Los Angelese MySQL User Group Feb 2014Scaling MySQl 1 to N Servers -- Los Angelese MySQL User Group Feb 2014
Scaling MySQl 1 to N Servers -- Los Angelese MySQL User Group Feb 2014Dave Stokes
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data AnalyticsCynthia Saracco
 
Big and Fast Data - Building Infinitely Scalable Systems
Big and Fast Data - Building Infinitely Scalable SystemsBig and Fast Data - Building Infinitely Scalable Systems
Big and Fast Data - Building Infinitely Scalable SystemsFred Melo
 
Presentation20130616
Presentation20130616Presentation20130616
Presentation20130616Adrian Warman
 
The Central View of your Data with Postgres
The Central View of your Data with PostgresThe Central View of your Data with Postgres
The Central View of your Data with PostgresEDB
 

Similar to Hadoop Fundamentals I (20)

BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
 
JavaOne BOF 5957 Lightning Fast Access to Big Data
JavaOne BOF 5957 Lightning Fast Access to Big DataJavaOne BOF 5957 Lightning Fast Access to Big Data
JavaOne BOF 5957 Lightning Fast Access to Big Data
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant Dance
 
In memory computing principles by Mac Moore of GridGain
In memory computing principles by Mac Moore of GridGainIn memory computing principles by Mac Moore of GridGain
In memory computing principles by Mac Moore of GridGain
 
IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17
 
Operational Intelligence Using Hadoop
Operational Intelligence Using HadoopOperational Intelligence Using Hadoop
Operational Intelligence Using Hadoop
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center ZurichData Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
 
Aug 2012 HUG: Random vs. Sequential
Aug 2012 HUG: Random vs. SequentialAug 2012 HUG: Random vs. Sequential
Aug 2012 HUG: Random vs. Sequential
 
Inovação e equipes geograficamente distribuídas - Palestrante: Maíra Gatti
Inovação e equipes geograficamente distribuídas - Palestrante: Maíra GattiInovação e equipes geograficamente distribuídas - Palestrante: Maíra Gatti
Inovação e equipes geograficamente distribuídas - Palestrante: Maíra Gatti
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of Things
 
Scaling MySQl 1 to N Servers -- Los Angelese MySQL User Group Feb 2014
Scaling MySQl 1 to N Servers -- Los Angelese MySQL User Group Feb 2014Scaling MySQl 1 to N Servers -- Los Angelese MySQL User Group Feb 2014
Scaling MySQl 1 to N Servers -- Los Angelese MySQL User Group Feb 2014
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
 
Big and Fast Data - Building Infinitely Scalable Systems
Big and Fast Data - Building Infinitely Scalable SystemsBig and Fast Data - Building Infinitely Scalable Systems
Big and Fast Data - Building Infinitely Scalable Systems
 
Presentation20130616
Presentation20130616Presentation20130616
Presentation20130616
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
The Central View of your Data with Postgres
The Central View of your Data with PostgresThe Central View of your Data with Postgres
The Central View of your Data with Postgres
 

More from Romeo Kienzler

Parallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network TrainingParallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network TrainingRomeo Kienzler
 
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkCognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkRomeo Kienzler
 
Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...Romeo Kienzler
 
Blockchain Technology Book Vernisage
Blockchain Technology Book VernisageBlockchain Technology Book Vernisage
Blockchain Technology Book VernisageRomeo Kienzler
 
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...Romeo Kienzler
 
IBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, QatarIBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, QatarRomeo Kienzler
 
Apache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine LearningApache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine LearningRomeo Kienzler
 
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Romeo Kienzler
 
DeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoTDeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoTRomeo Kienzler
 
Real-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor DataReal-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor DataRomeo Kienzler
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Romeo Kienzler
 
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceScala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceRomeo Kienzler
 
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...Romeo Kienzler
 
TDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASTDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASRomeo Kienzler
 
Cloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa NeddamCloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa NeddamRomeo Kienzler
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...Romeo Kienzler
 
DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14Romeo Kienzler
 
Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014Romeo Kienzler
 
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursCloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursRomeo Kienzler
 

More from Romeo Kienzler (20)

Parallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network TrainingParallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network Training
 
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkCognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
 
Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...
 
Blockchain Technology Book Vernisage
Blockchain Technology Book VernisageBlockchain Technology Book Vernisage
Blockchain Technology Book Vernisage
 
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
 
IBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, QatarIBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, Qatar
 
Apache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine LearningApache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine Learning
 
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
 
DeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoTDeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoT
 
Geo Python16 keynote
Geo Python16 keynoteGeo Python16 keynote
Geo Python16 keynote
 
Real-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor DataReal-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor Data
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
 
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceScala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
 
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
 
TDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASTDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAAS
 
Cloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa NeddamCloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa Neddam
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14
 
Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014
 
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursCloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

Hadoop Fundamentals I

  • 1. © 2013 IBM Corporation1 AVNET – Hadoop Fundamentals I Romeo Kienzler IBM Innovation Center Zurich
  • 2. © 2013 IBM Corporation2 1) Welcome 2) What is big data? 3) Introduction to Hadoop 4) BigInsights 5) Hadoop architecture 6) Lab 1 – Core Hadoop 7) MapReduce 8) Lab 2 – MapReduce 9) Pig, Jaql, Hive, BigSQL, SystemT/AQL 10) Lab 3 – Pig, Hive, and Jaql 11) Certification on BigDataUniversity Agenda
  • 3. © 2013 IBM Corporation3 What is BIG data?
  • 4. © 2013 IBM Corporation4 Traditional Business Intelligence / Data Warehousing ...60 percent, were unsatisfied with their data warehousing system.¹ ¹http://www.information-management.com/issues/20010601/3494-1.html
  • 5. © 2013 IBM Corporation5 What is BIG data?
  • 6. © 2013 IBM Corporation6 What is BIG data?
  • 7. © 2013 IBM Corporation7 What is BIG data? Big Data Hadoop
  • 8. © 2013 IBM Corporation8 What is BIG data? Business Intelligence Data Warehouse
  • 9. © 2013 IBM Corporation9 Map-Reduce → Hadoop → BigInsights
  • 10. © 2013 IBM Corporation1010 Why is Big Data important? Data AVAILABLE to an organization data an organization can PROCESS Missed opportunity Enterprises are “more blind” to new opportunities. Organizations are able to process less and less of the available data. 100 Millionen Tweets are posted every day, 35 hours of video are beeing uploaded every minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed through the net. 80 % spam and viruses. => Prefiltering is more and more important.
  • 11. © 2013 IBM Corporation11 Why is Big Data important?
  • 12. © 2013 IBM Corporation12 Why is Big Data important?
  • 13. © 2013 IBM Corporation13 Why is Big Data important?
  • 14. © 2013 IBM Corporation1414 Volume Terabytes, petabytes, even exabytes Variety All kinds of data All kinds of analytics Velocity Agility Analyze data in. . . Hours instead of days Days instead of weeks Dynamically responsive Rapid data exploration Traditional / Non-traditional data sources Store Analyze Explore What is BIG data? Volume*Variaty*Velocity=Value
  • 15. © 2013 IBM Corporation15 BigData Analytics
  • 16. © 2013 IBM Corporation16 BigData Analytics – Predictive Analytics
  • 17. © 2013 IBM Corporation17 BigData Analytics – Predictive Analytics
  • 18. © 2013 IBM Corporation18 BigData Analytics – Correlation / Text / NLP
  • 19. © 2013 IBM Corporation19 BigData Analytics – Feature Extraction Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately¹ ¹: Wikipedia
  • 20. © 2013 IBM Corporation20 BigData Analytics – Predictive Analytics Storage / DataCPU’s / Algorithm Business Value / Insight
  • 21. © 2013 IBM Corporation21 BigData Analytics – Predictive Analytics "sometimes it's not who has the best algorithm that wins; it's who has the most data." (C) Google Inc. The Unreasonable Effectiveness of Data¹ ¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf No Sampling => Work with full dataset => Long Tail Distributions
  • 22. © 2013 IBM Corporation22 Realtime / In-Memory Computing: InfoSphere Streams / Watson
  • 23. © 2013 IBM Corporation23
  • 24. © 2013 IBM Corporation24
  • 25. © 2013 IBM Corporation25
  • 26. © 2013 IBM Corporation26 The Paris Hilton Problem Watson Workshop: What is Watson?
  • 27. © 2013 IBM Corporation27 Introduction to Hadoop
  • 28. © 2013 IBM Corporation28
  • 29. © 2013 IBM Corporation29 BigInsights
  • 30. © 2013 IBM Corporation30
  • 31. © 2013 IBM Corporation31 BigInsights Demonstration
  • 32. © 2013 IBM Corporation32 Hadoop Architecture
  • 33. © 2013 IBM Corporation33
  • 34. © 2013 IBM Corporation34
  • 35. © 2013 IBM Corporation35 HDFS – Hadoop File System
  • 36. © 2013 IBM Corporation36
  • 37. © 2013 IBM Corporation37
  • 38. © 2013 IBM Corporation38
  • 39. © 2013 IBM Corporation39
  • 40. © 2013 IBM Corporation40
  • 41. © 2013 IBM Corporation41
  • 42. © 2013 IBM Corporation42
  • 43. © 2013 IBM Corporation43
  • 44. © 2013 IBM Corporation44
  • 45. © 2013 IBM Corporation45
  • 46. © 2013 IBM Corporation46
  • 47. © 2013 IBM Corporation47
  • 48. © 2013 IBM Corporation48
  • 49. © 2013 IBM Corporation49
  • 50. © 2013 IBM Corporation50
  • 51. © 2013 IBM Corporation51
  • 52. © 2013 IBM Corporation52
  • 53. © 2013 IBM Corporation53
  • 54. © 2013 IBM Corporation54 Lab 1 – Hadoop Architecture 1)Start from chapter 1.2 2)Replace /home/biadmin with /home/biadminX where X is your user ID 3)In chapter 1.3 skip task 1.3.1._1 and go to http://10.199.20.51:8080 instead 4)Skip 1.3.5 5)In chapter 1.3.6._30 use any file you like on your desktop computer
  • 55. © 2013 IBM Corporation55 Map-Reduce
  • 56. © 2013 IBM Corporation56
  • 57. © 2013 IBM Corporation57
  • 58. © 2013 IBM Corporation58
  • 59. © 2013 IBM Corporation59
  • 60. © 2013 IBM Corporation60
  • 61. © 2013 IBM Corporation61
  • 62. © 2013 IBM Corporation62
  • 63. © 2013 IBM Corporation63
  • 64. © 2013 IBM Corporation64
  • 65. © 2013 IBM Corporation65
  • 66. © 2013 IBM Corporation66
  • 67. © 2013 IBM Corporation67
  • 68. © 2013 IBM Corporation68
  • 69. © 2013 IBM Corporation69
  • 70. © 2013 IBM Corporation70
  • 71. © 2013 IBM Corporation71
  • 72. © 2013 IBM Corporation72
  • 73. © 2013 IBM Corporation73
  • 74. © 2013 IBM Corporation74
  • 75. © 2013 IBM Corporation75
  • 76. © 2013 IBM Corporation76
  • 77. © 2013 IBM Corporation77
  • 78. © 2013 IBM Corporation78
  • 79. © 2013 IBM Corporation79
  • 80. © 2013 IBM Corporation80
  • 81. © 2013 IBM Corporation81
  • 82. © 2013 IBM Corporation82
  • 83. © 2013 IBM Corporation83
  • 84. © 2013 IBM Corporation84
  • 85. © 2013 IBM Corporation85
  • 86. © 2013 IBM Corporation86
  • 87. © 2013 IBM Corporation87
  • 88. © 2013 IBM Corporation88
  • 89. © 2013 IBM Corporation89
  • 90. © 2013 IBM Corporation90
  • 91. © 2013 IBM Corporation91
  • 92. © 2013 IBM Corporation92
  • 93. © 2013 IBM Corporation93
  • 94. © 2013 IBM Corporation94
  • 95. © 2013 IBM Corporation95
  • 96. © 2013 IBM Corporation96
  • 97. © 2013 IBM Corporation97 Data Parallelism
  • 98. © 2013 IBM Corporation98 Aggregated Bandwith between CPU, Main Memory and Hard Drive 1 TB (at 10 GByte/s) - 1 Node - 100 sec - 10 Nodes - 10 sec - 100 Nodes - 1 sec - 1000 Nodes - 100 msec
  • 99. © 2013 IBM Corporation99
  • 100. © 2013 IBM Corporation100
  • 101. © 2013 IBM Corporation101
  • 102. © 2013 IBM Corporation102
  • 103. © 2013 IBM Corporation103 Lab 2 - MapReduce 1)Skip task 1.1._1, use putty to connect to biadmin@10.199.20.51 instead 2)Replace /home/biadmin with /home/biadminX where X is your user ID 3)In 1.1._4 - 1.1._6 replace output with with /home/biadminX/output where X is your user ID 4)Skip chapter 1.2 5)Chapter 1.3 is optional (using your local virtual machine), maybe during lunch break :)
  • 104. © 2013 IBM Corporation104 Pig, Jaql, Hive, BigSQL, SystemT/AQL
  • 105. © 2013 IBM Corporation105
  • 106. © 2013 IBM Corporation106
  • 107. © 2013 IBM Corporation107
  • 108. © 2013 IBM Corporation108
  • 109. © 2013 IBM Corporation109
  • 110. © 2013 IBM Corporation110
  • 111. © 2013 IBM Corporation111
  • 112. © 2013 IBM Corporation112
  • 113. © 2013 IBM Corporation113
  • 114. © 2013 IBM Corporation114
  • 115. © 2013 IBM Corporation115
  • 116. © 2013 IBM Corporation116
  • 117. © 2013 IBM Corporation117
  • 118. © 2013 IBM Corporation118
  • 119. © 2013 IBM Corporation119
  • 120. © 2013 IBM Corporation120
  • 121. © 2013 IBM Corporation121
  • 122. © 2013 IBM Corporation122
  • 123. © 2013 IBM Corporation123
  • 124. © 2013 IBM Corporation124
  • 125. © 2013 IBM Corporation125
  • 126. © 2013 IBM Corporation126
  • 127. © 2013 IBM Corporation127
  • 128. © 2013 IBM Corporation128
  • 129. © 2013 IBM Corporation129
  • 130. © 2013 IBM Corporation130
  • 131. © 2013 IBM Corporation131
  • 132. © 2013 IBM Corporation132
  • 133. © 2013 IBM Corporation133 SQL for BigInsights  Data warehouse augmentation is a very common use case for Hadoop  While highly scalable, MapReduce is notoriously difficult to use – Java API is tedious and requires programming expertise – Unfamiliar languages (e.g. Pig) also requiring expertise – Many different file formats, storage mechanisms, configuration options, etc. – Joins, grouping, sorting tedious to orchestrate  SQL support opens the data to a much wider audience – Familiar, widely known syntax – Common catalog for identifying data and structure – Clear separation of defining the what (you want) vs. the how (to get it)
  • 134. © 2013 IBM Corporation134 Query Processing  Big SQL consists of two query processing engines – The SQL optimization engine – Jaql as the query execution engine Client SQL Engine Jaql Jaql SQL Optimizer Runtime
  • 135. © 2013 IBM Corporation135 Big SQL vs. Alternatives  There are a number of SQL solutions, where does Big SQL fit in?  Hive – Open source • Established Hadoop component • Active development community – Restrictive SQL syntax • No subqueries (Hive 0.11 adds non-correlated subquery support) • No windowed aggregates (Hive 0.11 adds windowed aggregate support) • Ansi join syntax only – Limited type support • No varchar(n), decimal(p,s), etc. – Poor client support • Limited JDBC and ODBC drivers – Poor low-latency query support (via local mapreduce)
  • 136. © 2013 IBM Corporation136 Big SQL vs. Alternatives (cont.)  Impala – Recently open sourced – Achieves low latency by bypassing MapReduce infrastructure • Installs a completely separate execution infrastructure • Can lead to resource scheduling conflicts – Execution engine is C++ • Great for performance, makes extending difficult (e.g. UDF's & UDA's) • Support for limited set of file formats – Currently limited to broadcast joins • All tables must fit in memory (aggregate cluster memory) • Scalability limitation for larger clusters – Uses Hive 0.9 query syntax (more limitations than the current Hive) – Uses Hive 0.9 type system (more limitations than the current Hive)
  • 137. © 2013 IBM Corporation137
  • 138. © 2013 IBM Corporation138
  • 139. © 2013 IBM Corporation139
  • 140. © 2013 IBM Corporation140
  • 141. © 2013 IBM Corporation141 Lab 3 – Querying Data with Pig, Hive, Jaql 1)putty to biadmin@10.199.20.51 2)Skip task 1.1._2, start jaql shell using command /opt/ibm/biginsights/jaql/bin/jaqlshell 3)In 1.1._5 replace biadmin with with biadminX where X is your user ID 4)Skip chapter 1.2 (optional using virtual machine) 5)In 1.3._2 replace biadmin with with biadminX where X is your user ID 6)Instead of task 1.3._2 type /opt/ibm/biginsights/pig/bin/pig 7)In 1.3._4 replace sampleData/NewsGroups.csv with /user/biadminX/sampleData/NewsGroups.csv 8)Skip chapter 1.4 (optional using virtual machine) 9)Skip 1.5._12 and _13 and type /opt/ibm/biginsights/hive/bin/hive instead 10)Type "use biadminX" where X is your user ID 11)continue with task _14
  • 142. © 2013 IBM Corporation142 NoSQL Databases  Column Store – Hadoop / HBASE – Cassandra – Amazon Simple DB  JSON / Document Store – MongoDB – CouchDB  Key / Value Store – Amazon DynamoDB – Voldemort  Graph DBs – DB2 SPARQL Extension – Neo4J  MP RDBMS – DB2 DPF, DB2 pureScale, PureData for Operational Analytics – Oracle RAC – Greenplum http://nosql-database.org/ > 150
  • 143. © 2013 IBM Corporation143 CAP Theorem / Brewers Theorem¹  impossible for a distributed computer system simultaneously guarantee all 3 properties – Consistency (all nodes see the same data at the same time) – Availability (guarantee that every request knows whether it was successful or failed) – Partition tolerance (continues to operate despite failure of part of the system)  What about ACID? – Atomicity – Consistency – Isolation – Durability  BASE, the new ACID – Basically Available – Soft state – Eventual consistency • Monotonic Read Consistency • Monotonic Write Consistency • Read Your Own Writes
  • 144. © 2013 IBM Corporation144 Certification  Go to www.bigdatauniversity.com  Search for “hadoop fundamentals”  Choose “Hadoop Fundamentals I – Version 2”  Sign up  Login with existing account or one of the following:  Take the test:
  • 145. © 2013 IBM Corporation145 Questions?