SlideShare uma empresa Scribd logo
1 de 62
Baixar para ler offline
Andrey Vykhodtsev
Andrey.Vykhodtsev@si.ibm.com
Agenda
• Massive Parallel Processing concepts
• Overview of Hadoop Architecture
• Processing Engines
• Map Reduce
• Spark
• Hive, Pig, BigSQL
• Hadoop distributions
• Stream processing
• Advanced analytics on Hadoop
Big Data
 An umbrella term that really means
“analytics at scale on any kind of data”
 It is about :
 Scalability
 Cost reduction (per terabyte or of
infrastructure)
 Variety of formats to analyze
 New types of analytics
Use Cases
 Telco
 Mediation
 Geolocation / fencing
 Call archival
 Lawful intercept
 …
 Banking
 Counter-fraud
 Regulatory
compliance
 Analyzing customer
behavior
Definition
 Perform a set of coordinated
computations in parallel (wiki def.)
 Grid computing
 Cluster computing
 Why? To make things faster
 To count buttons of people on a stadium can
be done in 34* days by 1 person or in 50
minutes* by 1000 persons
Types of systems
 Shared Memory
(SMP)
 Simple in implementing
data processing
features
 Expensive to scale
 Shared Disk clusters
 Easier to implement
storage layer
 Bottlenecked above
storage layer
 Harder to scale
 Shared nothing
clusters
Types of systems (cont.)
 Shared Nothing Clusters
Types of systems (cont.)
 Relational Database management system
 SQL Support
 ACID (Atomicity, Consistency, Isolation,
Durability)
 All interfaces lower than SQL are hidden
 Netezza
 General Processing frameworks
 Lower level interfaces exposed
 MPI
 Hadoop
Notable systems
 MPP RDBMS Teradata ~ 1980
 Netezza ~ 2000
 Hadoop ~ 2006
NoSQL
 CAP Theorem
 Consistency, Availability, Partition tolerance
 Pick 2
 BASE in contrast to ACID
 Cloudant, HBASE
 Different database genres
 Graph
 Document Store
 Columnar
 Key Value
Hadoop
 Distributed platform for thousands of nodes
 Data storage and computation framework
 Open source
 Runs on commodity hardware
 Flexible – everything is loosely coupled
Hadoop benefits
 Linear scalability
 Software resilience rather than
expensive hardware
 “Schema on read”
 Parallelism
 Variety of tools
The Hadoop Filesystem (HDFS)
 Driving principals
 Files are stored across the entire cluster
 Programs are brought to the data, not the data to the program
 Distributed file system (DFS) stores blocks across the whole cluster
 Blocks of a single file are distributed across the cluster
 A given block is typically replicated as well for resiliency
 Just like a regular file system, the contents of a file is up to the application
 Unlike a regular file system, you can ask it “where does each block of my file live?”
FILE
BLOCK
S
Hadoop Distributed File System
HDFS
 Stores files in folders
 Nobody cares what’s in your files
 Chunks large files into blocks (~64MB-2GB)
 3 replicates of each block (by default)
 Blocks are scattered all over the place
FILE
BLOCK
S
HDFS – Architecture
 Master / Slave architecture
 Master: NameNode
 manages the file system
namespace and metadata
○ FsImage
○ EditLog
 regulates access by files by
clients
 Slave: DataNode
 many per cluster
 manages storage attached to
the nodes
 periodically reports status to
NameNode
a
a
a
b
b
b
d
d
dc c
c
File1
a
b
c
d
NameNode
DataNodes
 Common pattern in data processing: apply a function, then aggregate
- Identify words in each line of a document collection
- For each word, return the sum of occurrences throughout the collection
 User simply writes two pieces of code: “mapper” and “reducer”
- Mapper executes on every split of every file
- Reducer consumes/aggregates mapper outputs
• The Hadoop MR framework takes care of the rest (resource allocation,
scheduling, coordination, storage of final result on DFS, . . . )
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
1
2
3
Logical File
Splits
1
Cluster
32
Map Map MapReduce
Result
MapReduce
Logical MapReduce Example: Word
Count
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Hello World Bye World
Hello IBM
Content of Input Documents
Reduce (final output):
< Bye, 1>
< IBM, 1>
< Hello, 2>
< World, 2>
Map 1 emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
Map 2 emits:
< Hello, 1>
< IBM, 1>
MapReduce processing
Hello World Bye World
Hello IBM
Input Documents
Reduce (final output):
< Bye, 1>
< IBM, 1>
< Hello, 2>
< World, 2>
Map 1 emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
Map 2 emits:
< Hello, 1>
< IBM, 1>
Spark
 Spark brings two significant value-adds:
 Bring to Map Reduce the same added value that
databases (and parallel databases) brought to query
processing:
○ Let the app developer focus on the WHAT (they need to ask) and
let the system figure out HOW (it should be done).
○ Enable faster higher level application development through
higher level constructs and concepts: (RDD concept)
○ Let the system deal with performance (as part of the HOW)
 Leveraging memory (Bufferpools, Caching RDDs in memory)
 Maintaining sets of dedicated worker processes ready to go
(subagents in DBMS, Executors in Spark)
○ Enabling interactive processing (CLP, SQL*Plus, spark-shell,
etc….)
 Be one general purpose engine for multiples types of
workloads (SQL, Streaming, Machine Learning, etc…)
Spark (cont.)
 Apache Spark is a fast, general
purpose, easy-to-use cluster
computing system for large-scale
data processing
 Fast
○ Leverages aggressively cached in-memory
distributed computing and dedicated
App Executor processes even when no jobs
are running
○ Faster than MapReduce
 General purpose
○ Covers a wide range of workloads
○ Provides SQL, streaming and complex
analytics
 Flexible and easier to use than Map
Reduce
○ Spark is written in Scala, an object oriented,
functional programming language
○ Scala, Python and Java APIs
○ Scala and Python interactive shells
○ Runs on Hadoop, Mesos, standalone or
cloud
Logistic regression in Hadoop and Spark
Spark Stack
val wordCounts =
sc.textFile("README.md").flatMap(line =>
line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
WordCount
Spark (cont.)
Pig
 Pig is a query language that runs MapReduce
jobs
 Higher-level than MapReduce: write code in
terms of GROUP BY, DISTINCT, FOREACH,
FILTER, etc.
 Custom loaders and storage functions make
this good glue
A = LOAD ‘data.txt’
AS (name:chararray, age:int, state:chararray);
B = GROUP A BY state;
C = FOREACH B GENERATE group, COUNT(*), AVG(age);
dump c;
Hive
 SQL Engine on top of MapReduce
 Rapidly developed, lots of features
 Query language – HiveQL – is deviant
from ANSI
 Lack of cost based query optimizer,
statistics, and many other features
 Not responsive enough for small jobs
BigSQL
Data shared with Hadoop ecosystem
Comprehensive file format support
Superior enablement of IBM and Third Party
software
Modern MPP runtime
Powerful SQL query rewriter
Cost based optimizer
Optimized for concurrent user throughput
Results not constrained by memory
Distributed requests to multiple data sources
within a single SQL statement
Main data sources supported:
DB2 LUW, Teradata, Oracle, Netezza,
Informix, SQL Server
Advanced security/auditing
Resource and workload management
Self tuning memory management
Comprehensive monitoring
Comprehensive SQL Support
IBM SQL PL compatibility
Extensive Analytic Functions
A lot of buzzwords
 Ambari – web admin interface
 Zookeper – distributed object sync
 Hbase – NoSQL key/value store
 Flume – buffered ingestion
 Sqoop – Database import/export
 Oozie – workflow manager
 YARN – cluster resource manager
 Nagios/Ganglia – monitoring, metrics
Hortonworks HDP
Cloudera
IBM BigInsights
Text Analytics
POSIX Distributed
Filesystem
Multi-workload, multi-tenant
scheduling
IBM BigInsights
Enterprise Management
Machine Learning on
Big R
Big R (R support)
IBM Open Platform with Apache Hadoop*
(HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,
Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider)
IBM BigInsights
Data Scientist
IBM BigInsights
Analyst
Big SQL
BigSheets
Industry standard SQL
(Big SQL)
Spreadsheet-style
tool (BigSheets)
Free Quick Start (non production):
• IBM Open Platform
• BigInsights Analyst, Data Scientist
features
• Community support
. . .
IBM BigInsights for
Apache Hadoop
Overview
 Analyzing data on the fly vs. storing it
 Sometimes both has to be done
 Batch vs. Stream processing
 Low latency needs special design
considerations
 Processing is done on “windows” rather
then on tables/dataframes
 Engines differ by architecture,
development tools, latency
Apache Flume
 Agents can be
installed on variety
of platforms
 Collectors buffer
data and put to
HDFS
 Reliable
 Limited to micro
batch data collection
server agent
Collectorserver agent
server agent
HDFS
Spark Streaming
 Micro batch engine
 Reliable
 Integrated with Spark
Apache Storm
 Twitter project
now in Apache
 Development in
Java
 Bolts and Spouts
 Guaranteed
record delivery
Infosphere Streams
 Most performing and
sophisticated streaming
engine
 Easy IDE
 Declarative streaming
language
 Parallel execution
framework
 Many advanced toolkits
 Video, audio, signal
processing, finance,
geospatial, integration, etc
 Integrated with enterprise
tools
Data Science Life: Two Main
Tasks
1) Exploration: We don’t have any special attribute we want
to predict. Rather we want to understand the structure
present in the data. Are there clusters? Non-obvious
relationships?
- Also referred to as “unsupervised learning”
- E.g., K-means clustering
Use Cases -> Understanding categories of customers, cross-
selling opportunities, etc…
2) Prediction: The data contains a particular attribute
(called the target attribute) and we want to learn how the
target attribute depends on the other attributes.
- Also referred to as “supervised learning”
- E.g., Support vector machines
Use Cases -> Building a model to predict customer
churn, fraud, etc…
Data Science Life: Tools at Present
SQL
(42%) R
(33%) Python
(26%)
Excel
(25%)
Java
Ruby
C++
(17%)
SPSS
SAS
(9%)
Data Science Life: Skillset of the Data
Scientist
Statistician
Software
Engineer
Business
Analyst
Process Automation
Parallel Computing
Software Development
Database Systems
Mathematics Background
Analytic Mindset
Domain Expertise
Business Focus
Effective Communication
CRISP-DM: Cross Industry Standard
Process for Data Mining
The Typical Data Science Workflow
The Architect: What is Open Source R?
What is CRAN?
R is a powerful programming language and environment for statistical
computing and graphics.
R offers a rich analytics ecosystem:
 Full analytics life-cycle
○ Data exploration
○ Statistical analysis
○ Modeling, machine learning, simulations
○ Visualization
 Highly extensible via user-submitted packages
○ Tap into innovation pipeline contributed to by highly-regarded statisticians
○ Currently 4700+ statistical packages in repository
○ Easily accessible via CRAN, the Comprehensive R Archive Network
 R is the fastest growing data analysis software
○ Deeply knowledgeable and supportive analytics community
○ The most popular software used in data analysis competitions
○ Gaining speed in corporate, government, and academic settings
49
Big R Architecture
1
Scalable
Algorithms
Scalable Data
Processing
Native
R functions
R User
Interface
2 3
User Experience for Big R
Connect to BI cluster
Data frame proxy to large data file
Data transformation step
Run scalable linear regression on cluster
IBM System ML
 Collection of distributed algoritms
 Currently embedded in BigR
 Contributed to Spark on 15.06.15
SPSS on Hadoop
Python for data analysis
 Ipython notebooks
 Pandas/numpy
 Scikit
 matplotlib
 Python Spark API
Collection of distributed
algorithms
Want to learn more?
 Download Quick Start offering
 Test drive the technologies
 Links all available from HadoopDev
– https://developer.ibm.com/hadoop/

Mais conteúdo relacionado

Mais procurados

02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSpraveen bhat
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 

Mais procurados (20)

Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 

Destaque

20150617 spark meetup zagreb
20150617 spark meetup zagreb20150617 spark meetup zagreb
20150617 spark meetup zagrebAndrey Vykhodtsev
 
Asp.net mvc framework qua cac vi du
Asp.net mvc framework  qua cac vi duAsp.net mvc framework  qua cac vi du
Asp.net mvc framework qua cac vi duKim Hyun Hai
 
Examen mensual iv bimestre pfrh
Examen mensual iv bimestre pfrhExamen mensual iv bimestre pfrh
Examen mensual iv bimestre pfrhCARLOS ROSALES
 
Can ban ve thiet ke va lap trinh game
Can ban ve thiet ke va lap trinh gameCan ban ve thiet ke va lap trinh game
Can ban ve thiet ke va lap trinh gameKim Hyun Hai
 
Hdth.chuong5 ado.netv2.0
Hdth.chuong5 ado.netv2.0Hdth.chuong5 ado.netv2.0
Hdth.chuong5 ado.netv2.0Kim Hyun Hai
 
Career of psychology
Career of psychologyCareer of psychology
Career of psychologychiranjib10
 
Convert psd to html5
Convert psd to html5Convert psd to html5
Convert psd to html5Kim Hyun Hai
 
Installing Hadoop / Spark from scratch
Installing Hadoop / Spark from scratchInstalling Hadoop / Spark from scratch
Installing Hadoop / Spark from scratchAndrey Vykhodtsev
 

Destaque (9)

20150617 spark meetup zagreb
20150617 spark meetup zagreb20150617 spark meetup zagreb
20150617 spark meetup zagreb
 
Asp.net mvc framework qua cac vi du
Asp.net mvc framework  qua cac vi duAsp.net mvc framework  qua cac vi du
Asp.net mvc framework qua cac vi du
 
Examen mensual iv bimestre pfrh
Examen mensual iv bimestre pfrhExamen mensual iv bimestre pfrh
Examen mensual iv bimestre pfrh
 
Can ban ve thiet ke va lap trinh game
Can ban ve thiet ke va lap trinh gameCan ban ve thiet ke va lap trinh game
Can ban ve thiet ke va lap trinh game
 
Ow o
Ow oOw o
Ow o
 
Hdth.chuong5 ado.netv2.0
Hdth.chuong5 ado.netv2.0Hdth.chuong5 ado.netv2.0
Hdth.chuong5 ado.netv2.0
 
Career of psychology
Career of psychologyCareer of psychology
Career of psychology
 
Convert psd to html5
Convert psd to html5Convert psd to html5
Convert psd to html5
 
Installing Hadoop / Spark from scratch
Installing Hadoop / Spark from scratchInstalling Hadoop / Spark from scratch
Installing Hadoop / Spark from scratch
 

Semelhante a Big Data Essentials meetup @ IBM Ljubljana 23.06.2015

20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopSvetlin Nakov
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 

Semelhante a Big Data Essentials meetup @ IBM Ljubljana 23.06.2015 (20)

20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Big data
Big dataBig data
Big data
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
HADOOP
HADOOPHADOOP
HADOOP
 
hadoop
hadoophadoop
hadoop
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 

Mais de Andrey Vykhodtsev

Explaining machine learning models with python
Explaining machine learning models with pythonExplaining machine learning models with python
Explaining machine learning models with pythonAndrey Vykhodtsev
 
20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into PysparkAndrey Vykhodtsev
 
20180405 av toxic_comment_classification
20180405 av toxic_comment_classification20180405 av toxic_comment_classification
20180405 av toxic_comment_classificationAndrey Vykhodtsev
 
20180328 av kaggle_jigsaw_with_amlwb
20180328 av kaggle_jigsaw_with_amlwb20180328 av kaggle_jigsaw_with_amlwb
20180328 av kaggle_jigsaw_with_amlwbAndrey Vykhodtsev
 
20170927 py data_n3_bokeh_plotly
20170927 py data_n3_bokeh_plotly20170927 py data_n3_bokeh_plotly
20170927 py data_n3_bokeh_plotlyAndrey Vykhodtsev
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooksAndrey Vykhodtsev
 

Mais de Andrey Vykhodtsev (7)

Explaining machine learning models with python
Explaining machine learning models with pythonExplaining machine learning models with python
Explaining machine learning models with python
 
20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark20181003 Whirlwind tour into Pyspark
20181003 Whirlwind tour into Pyspark
 
20180405 av toxic_comment_classification
20180405 av toxic_comment_classification20180405 av toxic_comment_classification
20180405 av toxic_comment_classification
 
20180328 av kaggle_jigsaw_with_amlwb
20180328 av kaggle_jigsaw_with_amlwb20180328 av kaggle_jigsaw_with_amlwb
20180328 av kaggle_jigsaw_with_amlwb
 
20170927 py data_n3_bokeh_plotly
20170927 py data_n3_bokeh_plotly20170927 py data_n3_bokeh_plotly
20170927 py data_n3_bokeh_plotly
 
PyData Ljubljana meetup #1
PyData Ljubljana meetup #1PyData Ljubljana meetup #1
PyData Ljubljana meetup #1
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 

Último

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 

Último (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015

  • 2. Agenda • Massive Parallel Processing concepts • Overview of Hadoop Architecture • Processing Engines • Map Reduce • Spark • Hive, Pig, BigSQL • Hadoop distributions • Stream processing • Advanced analytics on Hadoop
  • 3.
  • 4. Big Data  An umbrella term that really means “analytics at scale on any kind of data”  It is about :  Scalability  Cost reduction (per terabyte or of infrastructure)  Variety of formats to analyze  New types of analytics
  • 5. Use Cases  Telco  Mediation  Geolocation / fencing  Call archival  Lawful intercept  …  Banking  Counter-fraud  Regulatory compliance  Analyzing customer behavior
  • 6.
  • 7. Definition  Perform a set of coordinated computations in parallel (wiki def.)  Grid computing  Cluster computing  Why? To make things faster  To count buttons of people on a stadium can be done in 34* days by 1 person or in 50 minutes* by 1000 persons
  • 8. Types of systems  Shared Memory (SMP)  Simple in implementing data processing features  Expensive to scale  Shared Disk clusters  Easier to implement storage layer  Bottlenecked above storage layer  Harder to scale  Shared nothing clusters
  • 9. Types of systems (cont.)  Shared Nothing Clusters
  • 10. Types of systems (cont.)  Relational Database management system  SQL Support  ACID (Atomicity, Consistency, Isolation, Durability)  All interfaces lower than SQL are hidden  Netezza  General Processing frameworks  Lower level interfaces exposed  MPI  Hadoop
  • 11. Notable systems  MPP RDBMS Teradata ~ 1980  Netezza ~ 2000  Hadoop ~ 2006
  • 12. NoSQL  CAP Theorem  Consistency, Availability, Partition tolerance  Pick 2  BASE in contrast to ACID  Cloudant, HBASE  Different database genres  Graph  Document Store  Columnar  Key Value
  • 13.
  • 14. Hadoop  Distributed platform for thousands of nodes  Data storage and computation framework  Open source  Runs on commodity hardware  Flexible – everything is loosely coupled
  • 15. Hadoop benefits  Linear scalability  Software resilience rather than expensive hardware  “Schema on read”  Parallelism  Variety of tools
  • 16. The Hadoop Filesystem (HDFS)  Driving principals  Files are stored across the entire cluster  Programs are brought to the data, not the data to the program  Distributed file system (DFS) stores blocks across the whole cluster  Blocks of a single file are distributed across the cluster  A given block is typically replicated as well for resiliency  Just like a regular file system, the contents of a file is up to the application  Unlike a regular file system, you can ask it “where does each block of my file live?” FILE BLOCK S
  • 17. Hadoop Distributed File System HDFS  Stores files in folders  Nobody cares what’s in your files  Chunks large files into blocks (~64MB-2GB)  3 replicates of each block (by default)  Blocks are scattered all over the place FILE BLOCK S
  • 18. HDFS – Architecture  Master / Slave architecture  Master: NameNode  manages the file system namespace and metadata ○ FsImage ○ EditLog  regulates access by files by clients  Slave: DataNode  many per cluster  manages storage attached to the nodes  periodically reports status to NameNode a a a b b b d d dc c c File1 a b c d NameNode DataNodes
  • 19.
  • 20.  Common pattern in data processing: apply a function, then aggregate - Identify words in each line of a document collection - For each word, return the sum of occurrences throughout the collection  User simply writes two pieces of code: “mapper” and “reducer” - Mapper executes on every split of every file - Reducer consumes/aggregates mapper outputs • The Hadoop MR framework takes care of the rest (resource allocation, scheduling, coordination, storage of final result on DFS, . . . ) 10110100 10100100 11100111 11100101 00111010 01010010 11001001 01010011 00010100 10111010 11101011 11011011 01010110 10010101 1 2 3 Logical File Splits 1 Cluster 32 Map Map MapReduce Result MapReduce
  • 21. Logical MapReduce Example: Word Count map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Hello World Bye World Hello IBM Content of Input Documents Reduce (final output): < Bye, 1> < IBM, 1> < Hello, 2> < World, 2> Map 1 emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> Map 2 emits: < Hello, 1> < IBM, 1>
  • 22. MapReduce processing Hello World Bye World Hello IBM Input Documents Reduce (final output): < Bye, 1> < IBM, 1> < Hello, 2> < World, 2> Map 1 emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> Map 2 emits: < Hello, 1> < IBM, 1>
  • 23. Spark  Spark brings two significant value-adds:  Bring to Map Reduce the same added value that databases (and parallel databases) brought to query processing: ○ Let the app developer focus on the WHAT (they need to ask) and let the system figure out HOW (it should be done). ○ Enable faster higher level application development through higher level constructs and concepts: (RDD concept) ○ Let the system deal with performance (as part of the HOW)  Leveraging memory (Bufferpools, Caching RDDs in memory)  Maintaining sets of dedicated worker processes ready to go (subagents in DBMS, Executors in Spark) ○ Enabling interactive processing (CLP, SQL*Plus, spark-shell, etc….)  Be one general purpose engine for multiples types of workloads (SQL, Streaming, Machine Learning, etc…)
  • 24. Spark (cont.)  Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-scale data processing  Fast ○ Leverages aggressively cached in-memory distributed computing and dedicated App Executor processes even when no jobs are running ○ Faster than MapReduce  General purpose ○ Covers a wide range of workloads ○ Provides SQL, streaming and complex analytics  Flexible and easier to use than Map Reduce ○ Spark is written in Scala, an object oriented, functional programming language ○ Scala, Python and Java APIs ○ Scala and Python interactive shells ○ Runs on Hadoop, Mesos, standalone or cloud Logistic regression in Hadoop and Spark Spark Stack val wordCounts = sc.textFile("README.md").flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) WordCount
  • 26. Pig  Pig is a query language that runs MapReduce jobs  Higher-level than MapReduce: write code in terms of GROUP BY, DISTINCT, FOREACH, FILTER, etc.  Custom loaders and storage functions make this good glue A = LOAD ‘data.txt’ AS (name:chararray, age:int, state:chararray); B = GROUP A BY state; C = FOREACH B GENERATE group, COUNT(*), AVG(age); dump c;
  • 27. Hive  SQL Engine on top of MapReduce  Rapidly developed, lots of features  Query language – HiveQL – is deviant from ANSI  Lack of cost based query optimizer, statistics, and many other features  Not responsive enough for small jobs
  • 28. BigSQL Data shared with Hadoop ecosystem Comprehensive file format support Superior enablement of IBM and Third Party software Modern MPP runtime Powerful SQL query rewriter Cost based optimizer Optimized for concurrent user throughput Results not constrained by memory Distributed requests to multiple data sources within a single SQL statement Main data sources supported: DB2 LUW, Teradata, Oracle, Netezza, Informix, SQL Server Advanced security/auditing Resource and workload management Self tuning memory management Comprehensive monitoring Comprehensive SQL Support IBM SQL PL compatibility Extensive Analytic Functions
  • 29.
  • 30. A lot of buzzwords  Ambari – web admin interface  Zookeper – distributed object sync  Hbase – NoSQL key/value store  Flume – buffered ingestion  Sqoop – Database import/export  Oozie – workflow manager  YARN – cluster resource manager  Nagios/Ganglia – monitoring, metrics
  • 31.
  • 32.
  • 35. IBM BigInsights Text Analytics POSIX Distributed Filesystem Multi-workload, multi-tenant scheduling IBM BigInsights Enterprise Management Machine Learning on Big R Big R (R support) IBM Open Platform with Apache Hadoop* (HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider) IBM BigInsights Data Scientist IBM BigInsights Analyst Big SQL BigSheets Industry standard SQL (Big SQL) Spreadsheet-style tool (BigSheets) Free Quick Start (non production): • IBM Open Platform • BigInsights Analyst, Data Scientist features • Community support . . . IBM BigInsights for Apache Hadoop
  • 36.
  • 37. Overview  Analyzing data on the fly vs. storing it  Sometimes both has to be done  Batch vs. Stream processing  Low latency needs special design considerations  Processing is done on “windows” rather then on tables/dataframes  Engines differ by architecture, development tools, latency
  • 38. Apache Flume  Agents can be installed on variety of platforms  Collectors buffer data and put to HDFS  Reliable  Limited to micro batch data collection server agent Collectorserver agent server agent HDFS
  • 39. Spark Streaming  Micro batch engine  Reliable  Integrated with Spark
  • 40. Apache Storm  Twitter project now in Apache  Development in Java  Bolts and Spouts  Guaranteed record delivery
  • 41. Infosphere Streams  Most performing and sophisticated streaming engine  Easy IDE  Declarative streaming language  Parallel execution framework  Many advanced toolkits  Video, audio, signal processing, finance, geospatial, integration, etc  Integrated with enterprise tools
  • 42.
  • 43.
  • 44. Data Science Life: Two Main Tasks 1) Exploration: We don’t have any special attribute we want to predict. Rather we want to understand the structure present in the data. Are there clusters? Non-obvious relationships? - Also referred to as “unsupervised learning” - E.g., K-means clustering Use Cases -> Understanding categories of customers, cross- selling opportunities, etc… 2) Prediction: The data contains a particular attribute (called the target attribute) and we want to learn how the target attribute depends on the other attributes. - Also referred to as “supervised learning” - E.g., Support vector machines Use Cases -> Building a model to predict customer churn, fraud, etc…
  • 45. Data Science Life: Tools at Present SQL (42%) R (33%) Python (26%) Excel (25%) Java Ruby C++ (17%) SPSS SAS (9%)
  • 46. Data Science Life: Skillset of the Data Scientist Statistician Software Engineer Business Analyst Process Automation Parallel Computing Software Development Database Systems Mathematics Background Analytic Mindset Domain Expertise Business Focus Effective Communication
  • 47. CRISP-DM: Cross Industry Standard Process for Data Mining The Typical Data Science Workflow
  • 48.
  • 49. The Architect: What is Open Source R? What is CRAN? R is a powerful programming language and environment for statistical computing and graphics. R offers a rich analytics ecosystem:  Full analytics life-cycle ○ Data exploration ○ Statistical analysis ○ Modeling, machine learning, simulations ○ Visualization  Highly extensible via user-submitted packages ○ Tap into innovation pipeline contributed to by highly-regarded statisticians ○ Currently 4700+ statistical packages in repository ○ Easily accessible via CRAN, the Comprehensive R Archive Network  R is the fastest growing data analysis software ○ Deeply knowledgeable and supportive analytics community ○ The most popular software used in data analysis competitions ○ Gaining speed in corporate, government, and academic settings 49
  • 50.
  • 51. Big R Architecture 1 Scalable Algorithms Scalable Data Processing Native R functions R User Interface 2 3
  • 52. User Experience for Big R Connect to BI cluster Data frame proxy to large data file Data transformation step Run scalable linear regression on cluster
  • 53. IBM System ML  Collection of distributed algoritms  Currently embedded in BigR  Contributed to Spark on 15.06.15
  • 54.
  • 56.
  • 57. Python for data analysis  Ipython notebooks  Pandas/numpy  Scikit  matplotlib  Python Spark API
  • 58.
  • 60.
  • 61.
  • 62. Want to learn more?  Download Quick Start offering  Test drive the technologies  Links all available from HadoopDev – https://developer.ibm.com/hadoop/