SlideShare uma empresa Scribd logo
1 de 98
Hadoop Session
Contribute Summer of Technologies
Dieter De Witte (Archimiddle)
15/04/2015Contribute:SummerOfTechnologies
1
Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
2
I.ABig Data Introduction
History. What’s Big Data? What problems does it solve? Use cases.
Comparison with RDBMs.
15/04/2015Contribute:SummerOfTechnologies
3
How did it all start?
• Google: Indexing the web?
 Google File System (2003)
 MapReduce (2004)
• 2006: Doug Cutting joins Yahoo and
gets a dedicated team to work on his
Hadoop project
• 2008: Hadoop becomes a top level
Apache Project
• 2008: Hadoop breaks Terasort record:
 1TB, 910 nodes, 5 minutes
 2009: 59s (1400 nodes)
 2009: 3h (3400 nodes) 100TB sort
15/04/2015Contribute:SummerOfTechnologies
4
How does it all evolve?http://blog.mikiobraun.de/2013/02/big-data-beyond-map-reduce-googles-papers.html
Did Google sit back and relax?
• 2006: BigTable
• 2010: Percolator:
 BigTable +individual updates & transactions
• 2010: Pregel:
 scalable graph computing
• 2010: Dremel: interactive db (real-time)
• 2011: MegaStore (BigTable + schema)
 focus on distributed consistency
• 2012: Spanner (MegaStore + SQL)
Did the Open Source community?
• HBase (facebook messaging service)
• HBase
dfdfdfdfdfddfddfdfdfdfdfdfdfdfdfdfdfdfdfdfdb
• Apache Giraph, Neo4J
b
• Cloudera’s Impala
15/04/2015Contribute:SummerOfTechnologies
5
Doug Cutting: “Google is living a
few years in the future and sending
the rest of us messages”
Today’s data challenges
• Data momentum
= Volume * Velocity
• CAP Theorem
Parallel/Cloud computing
=> not only BIG data
=> also complex data analysis!
• DWH solutions are:
 Expensive!
 Not horizontally scalable
 Inflexible schemas
15/04/2015Contribute:SummerOfTechnologies
6
• Data Variety
 Data creation: humans <> machines
Use cases?
• Internet of Things:
Everything has an IP
• Customer behaviour analysis:
Clickstreams,...
• Social media analysis:
Twitter, Facebook, ...
• Fraud Detection:
Sample -> full datasets, realtime
• Cognitive computing: IBM Watson:
Large scale text mining
• Stack traces complex systems:
Discovering system failure patterns
• Energy:
Centralized production -> distributed Smart Grids
• Keyword: Personalized ...
 Medicine
 Janssen + Intel + Univ. = Exascience project
 Drug prescription
 Insurance
 Advertising
 Travelling
• A different approach:
 Pattern Discovery
<> Pattern Matching
15/04/2015Contribute:SummerOfTechnologies
7
Pattern matching versus Discovery
• Hadoop is a Data Scientist’s playground:
Explore your (big) data, discover new patterns
• RDBMS works with Data Committee
Which patterns do we want to store (schema)
• Example from my research background:
Where do certain DNA patterns occur?
How to model DNA patterns?
• Cooking analogy:
Follow a recipe OR be the cook
15/04/2015Contribute:SummerOfTechnologies
8
Big Data hype or reality?
JOB AD
University of Leuven, ESAT-STADIUS
In the framework of a collaboration with Janssen Pharmaceuticals, we are looking
for a talented postdoctoral researcher to develop kernel methods that link drug
targets, disease phenotypes, and pharmaceutical compounds. Leveraging
large-scale public and in-house data sets, you will develop kernel methods and/or
network-based methods to predict potential links between targets, diseases, or
candidate drugs. This research builds upon the expertise of Janssen Pharma and
previous work of our team on genomic data fusion. The research will be also
carried out with a team of the University of Linz, Austria (Prof. Sepp Hochreiter)
specialized in kernel learning and chemoinformatics.
Project Details:
 Exascience project: Janssen, Imec, Intel, Universities,..
 NGS Data ( 1 billion $ -> 2000$) -> mapping -> SNP dataset -> disease matching
 Trend: Replace wet lab experiments by computer simulations
15/04/2015Contribute:SummerOfTechnologies
9
Limitations of classical RDBMshttp://youtu.be/d2xeNpfzsYI?t=3m24s tot 5m00s
Lecturer:AmrAwadallah (CTO + founder cloudera)
15/04/2015Contribute:SummerOfTechnologies
10
• Data streams:
• DataSource -> Storage only (raw)
• Storage layer => ETL => RDBMS => BI
• 3 problems:
• STORAGE TO ETL is problematic: Moving data to compute doesn’t scale
• ETL typically overnight => not enough time too process all data!
• Too much network overhead moving data from storage to compute grid
• Solution? Move the code to where the data is!
• STORAGE TO archiving: Archiving data too early = premature data death
• Archiving too early since storage cost is too high (balance storage cost vs economic value)
• archiving is cheap but retrieval is extremely expensive!
• Solution? Storage has to become cheaper! (Return on byte)
• STORAGE TO BI: No ability to explore the original raw data
• You cannot ask NEW questions! Very inflexible!
15/04/2015Contribute:SummerOfTechnologies
11
The left hand and the right hand
Hadoop
Schema on read
Load is fast
Schema’s can change
Only Batch processing & no indexes
CAP: No transactions!
No atomic updates!
Commodity Hardware
Classical RDBMs
15/04/2015Contribute:SummerOfTechnologies
12
Schema on write
Load is slow (ETL first)
Adapting schema is very difficult
Read is fast (schema => indexing)
Very good at transactions
CRUD
Expensive (purpose) Data Warehouse
The end of my presentation? http://www.businessweek.com/articles/2014-06-
27/google-just-made-big-data-expertise-much-tougher-to-fake
15/04/2015Contribute:SummerOfTechnologies
13
For the last five years or so, it’s been pretty easy to pretend you knew something about Big Data. You went to the cocktail party—the one with all the dudes—grabbed a drink and
then said “Hadoop” over and over and over again. People nodded. Absurdly lucrative job offers rolled in the next day. Simple.
Well, Google (GOOG) officially put an end to the good times this week. During some talks at the company’s annual developer conference, Google executives declared that
they’re over Hadoop. It’s yesterday’s buzzword. Anyone who wants to be a true Big Data jockey will now need to be conversant in Flume, MillWheel, Google Cloud Dataflow, and
Spurch. (Okay, I made the last one up.)
Here’s the deal. About a decade ago, Google’s engineers wrote some papers detailing a new way to analyze huge stores of data. They described the method as MapReduce: Data
was spread in smallish chunks across thousands of servers; people asked questions of the information; and they received answers a few minutes or hours later. Yahoo! (YHOO) led
the charge to turn this underlying technology into an open-source product called Hadoop. Hundreds of companies have since helped establish Hadoop as more or less the standard
of modern data analysis work. (startups as Cloudera, Hortonworks, and MapR have their own versions of Hadoop that companies can use, and just about every company that needs
to analyze lots of informatioMuch has been written on this topic.) Such n has its own Hadoop team.
Google probably processes more information than any company on the planet and tends to have to invent tools to cope with the data. As a result, its technology runs a good five to
10 years ahead of the competition. This week, it is revealing that it abandoned the MapReduce/Hadoop approach some time ago in favor of some more flexible data analysis
systems.
One of the big limitations around Hadoop was that you tended to have to do “batch” operations, which means ordering a computer to perform an operation in bulk and then waiting
for the result. You might ask a mainframe to process a company’s payroll as a batch job, or in a more contemporary example, analyze all the search terms that people in Texas
typed into Google last Tuesday.
According to Google, its Cloud Dataflow service can do all this while also running data analysis jobs on information right as it pours into a database. One example Google
demonstrated at its conference was an instantaneous analysis of tweets about World Cup matches. You know, life-and-death stuff.
Google has taken internal tools—those funky-named ones such as Flume and MillWheel—and bundled them into the Cloud Dataflow service, which it plans to start offering to
developers and customers as a cloud service. The promise is that other companies will be able to deal with more information easier and faster than ever before.
While Google has historically been a very secretive company, it is opening up its internal technology as a competitive maneuver. Google is proving more willing than,
say, Amazon.com (AMZN) to hand over the clever things built by its engineers to others. It’s an understandable move, given Amazon’s significant lead in the cloud computing arena.
As for the Hadoop clan? You would think that Google flat-out calling it passé would make it hard to keep hawking Hadoop as the hot, hot thing your company can’t live without. And
there’s some truth to this being an issue.
That said, even the biggest Hadoop fans such as Cloudera have been moving past the technology for some time. Cloudera leans on a handful of super-fast data analysis engines
like Spark and Impala, which can grab data from Hadoop-based storage systems and torture it in ways similar to Google’s.
The painful upshot, however, is that faking your way through the Big Data realm will be much harder from now on. Try keeping your Flume and Impala straight after a couple of gin
and tonics.
Sidenotes:
15/04/2015Contribute:SummerOfTechnologies
14
• Hadoop = Distributed storage (HDFS)
+ Compute layer (MapReduce)
• HDFS is NOT QUESTIONED!
• Cloudera is already providing
Spark training
Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
15
Cloudera, Hortonworks, MapR (and Intel’s Hadoop Distribution)
15/04/2015Contribute:SummerOfTechnologies
16
I.B Hadoop Distributions
Forrester Wave: Enterprise hadoop
solutions (2012)
15/04/2015Contribute:SummerOfTechnologies
17
Performance of vendors on common
queries
15/04/2015Contribute:SummerOfTechnologies
18
Intel Makes Significant Equity
Investment in Cloudera
15/04/2015Contribute:SummerOfTechnologies
19
$740M Cloudera
Investment
PALO ALTO, Calif., and SANTA CLARA, Calif., March 27, 2014 – Intel
Corporation and Cloudera today announced a broad strategic technology
and business collaboration, as well as a significant equity investment
from Intel making it Cloudera’s largest strategic shareholder and a
member of its board of directors. This is Intel’s single largest data center
technology investment in its history. The deal will join Cloudera’s leading
enterprise analytic data management software powered by Apache
Hadoop™ with the leading data center architecture based on Intel®
Xeon® technology. The goal is acceleration of customer adoption of big
data solutions, making it easier for companies of all sizes to obtain
increased business value from data by deploying open source Apache
Hadoop solutions. Both the strategic collaboration and the equity
investment are subject to standard closing conditions, including
customary regulatory approvals.
Cloudera will develop and optimize Cloudera’s Distribution including
Apache Hadoop (CDH) for Intel architecture as its preferred platform and
support a range of next-generation technologies including Intel fabrics,
flash memory and security. In turn, Intel will market and promote CDH
and Cloudera Enterprise to its customers as its preferred Hadoop
platform. Intel will focus its engineering and marketing resources on the
joint roadmap. The optimizations from Intel’s Distribution for Apache
Hadoop/Intel Data Platform (IDH/IDP) will be integrated into CDH and
IDH/IDP and will be transitioned after v3.1 release at the end of March.
To ensure a seamless customer transition to CDH, Intel and Cloudera will
work together on a migration path from IDH/IDP. Cloudera will also
ensure that all enhancements will be contributed to their respective open
source projects and CDH.
...
Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
20
II. Hadoop Architecture
HDFS & MR architecture
Figures: Hadoop in practice, Data-intensive Text Processing, Hadoop: the definitive guide
15/04/2015Contribute:SummerOfTechnologies
21
Hadoop: The operating system for data
clusters
• Hadoop is a scalable fault-tolerant distributed system for data storage and
processing: Failure is the rule not the exception!
15/04/2015Contribute:SummerOfTechnologies
22
Hadoop Distributed Filesystem
• HDFS is optimized for streaming reads
and writes:
 HDFS uses large blocksizes (64 or 128
MB) => HD seek time negligible
(compared to read/write))
 HDFS replicates its data blocks (usually 3
times) to improve availability and fault
tolerance (resilient against node failure!)
• HDFS: No Updates!
• Master-Slave: Namenode-Datanode
15/04/2015Contribute:SummerOfTechnologies
23
Hadoop reads
15/04/2015Contribute:SummerOfTechnologies
24
• NameNode is single point
of failure
• DataNodes store the
blocks & report block health
to NameNode
• NameNode stores:
• File – Block mapping
• Block – Node mapping
Hadoop writes
15/04/2015Contribute:SummerOfTechnologies
25
• DataNode reports write
succes
• Replication taken care
of by pipeline pattern to
avoid NameNode
bottleneck!
MapReduce execution engine
15/04/2015Contribute:SummerOfTechnologies
26
• Master-Slave: JobTracker-TaskTracker
• JT schedules map and reduce tasks on
TaskTrackers
• JT tries to schedule the work near the
data => Move algorithm NOT data
• JT sends heartbeats to TTs to check health
• If a task fails 4 times => JOB failure
• If a TT fails 4 times => Removed from pool
• A TT has Map & Reduce slots to
run M & R tasks
• Anything can be configured!
Data flow
15/04/2015Contribute:SummerOfTechnologies
27
3 Key Value transformations:
(k1,v1) (line nr, text)
(k2,v2) (word, frequency)
(k3,v3) (word, list<freq>)
Ser/De in Hadoop
• A Mapper processes an InputSplit = { (k1,v1), (k2,v2), (k3,v3), ... }
• How is an InputSplit defined?
• InputFormat class splits input and RecordReader generates KV pairs from split
(TextInputFormat: split = slice of file, record = line of text)
• InputFormat tries to make splits = file blocks in HDFS
=> DATA LOCALITY
• Custom InputFormats possible : JSON, XML, SequenceFiles, ...
• Hadoop has its own serialization types: WritableComparables (Text,
IntWritable, FloatWritable, BytesWritable,...)
15/04/2015Contribute:SummerOfTechnologies
28
15/04/2015Contribute:SummerOfTechnologies
29
InputFormats
Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
30
III. Hadoop Ecosystem
& Design patterns
Intro to MapReduce programming; Using the ecosystem to stimulate code
reuse;
15/04/2015Contribute:SummerOfTechnologies
31
Hadoop
Word Count
Problem Description
• Given a set of documents, calculate
the number of times each word
occurs.
• A MapReduce program consists of:
 A Driver
 A Mapper (optional)
 A Reducer (optional)
DRIVER CODE
15/04/2015Contribute:SummerOfTechnologies
32
Hello World Hadoop: Word Count
• WordCountMapper:
 The InputFormat partitions the data
in a set of splits.
 Each split is fed to a Mapper
 A RecordReader generates records
= Text Value
 The Mapper takes a record an
splits the text into words
 The Mapper emits every word with
frequency one.
• WordCountReducer:
 Hadoop is responsible for getting all
key-value pairs with the same key
to one reducer (parallel sort)
 A reducer gets a collection of
values which go with a single key
(frequencies)
 This reducer adds up the
frequencies and emits the sum to a
file specified in setOutputPath(...)
15/04/2015Contribute:SummerOfTechnologies
33
Optimization patterns in MRData-Intensive Text Processing (Lin,Dyer)
Use of Combiners
• A combiner is a mini-reducer which can
be run an arbitrary number of times on
the output of a mapper before it is
streamed to disk
In-Mapper Combiner design pattern
• Create a HashMap in your Mapper
class in which you store all
(word,frequency) pairs, emit when
mapper has finished with a split
• Drawback: Hashmap must fit in
memory!
15/04/2015Contribute:SummerOfTechnologies
34
• Each Context.write() streams data to the local filesystem! (BOTTLENECK!)
• HINT: Is it necessary to emit every single (word, 1) pair from the mapper?
NLP: Word co-occurrence matricesData-Intensive Text Processing (Lin,Dyer)
Problem Description
• For each word we want to calculate the
relative frequencies of words co-
occurring with this word (in the same
sentence).
• Example:
 (dog,cat) occurs 2 times, (dog,walking)
occurs 3 times then the results should be:
=> (dog,cat) = 40%, (dog,walking) = 60%
• Requirements?
 We must count all word co-occurrences &
sum all the (dog,*) combinations to calculate
the relative frequencies
MAPPERv1
15/04/2015Contribute:SummerOfTechnologies
35
Relative frequencies!
Problem
• We need to know how much time dog
occurs together with any other word in
orde to calculate the relative
frequencies!
• Solution emit (dog,*) pairs as well!
MAPPERv2
15/04/2015Contribute:SummerOfTechnologies
36
How to get the data to the reducer?
Problem
• (dog,*), (dog,cat), (dog,walking) are
different keys, they might end up in a
different reducer!!!
• MapReduce possibilities
 MapReduce has a Partitioner which decides
where each K,V must go
 MapReducer has a GroupingComparator to
decide which K,V pairs end up in one reducer
group
 MapReduce has a SortComparator to
decide on how to sort the K,V pairs in the
reducer group
15/04/2015Contribute:SummerOfTechnologies
37
Partitioner and Reducer
15/04/2015Contribute:SummerOfTechnologies
38
Limitations of MapReduce
Problem Description
• MapReduce does NOT stimulate code
reuse!
 Suppose we have a table and we want to
calculate the average, the minimum and
the maximum
 This can be done with 1 job but to make your
code reusable you need 3!
 A max job, a min job and an avg job
• MapReduce requires a lot of coding!
• Relational operators such as Joins,
Orders,... should only be written once!
Solution?
• On top of MapReduce 2 scripting
languages are build which allow one to
use relational logic which is then
translated in a sequence of MapReduce
jobs:
 PIG and HIVE
 Pig is a Dataflow language: single data
transformation
 Hive is an sql-like language
 Bottom line: you can use hadoop without
knowledge of mapreduce!!!!!!
15/04/2015Contribute:SummerOfTechnologies
39
Hadoop Ecosystem
15/04/2015Contribute:SummerOfTechnologies
40
Pig’s philosophyhttp://pig.apache.org/philosophy.html
• Pigs eat anything
 relational, nested, unstructured,... data
• Pigs live anywhere
 Hadoop is not strictly required
• Pigs are domestic animals
 integration with other languages (Python)
 extendible with UDFs
• Pigs fly
 optimizes its translation to MR jobs
15/04/2015Contribute:SummerOfTechnologies
41
Pig Latinhttp://pig.apache.org/docs/r0.8.1/piglatin_ref2.html
• Data types: same as SQL + {tuples, bags, maps}
• LOAD ‘path’ USING PigStorage(delim) AS ... (schema)
• STORE users INTO ‘path’ USING PigStorage(delim)
• DISTINCT operator: only keep unique records
• FILTER operator:
 FILTER users BY age == 30;
• SPLIT
 SPLIT users INTO adults IF age >= 18,
 children OTHERWISE;
• ORDER
 ORDER users BY age DESC/ASC;
15/04/2015Contribute:SummerOfTechnologies
42
Pig Latin (cont’d)
• FOREACH users GENERATE (projection + operations between columns)
 name,
 age;
• Nesting data with GROUP BY operator:
 gr = GROUP users BY age;
 age_counts = FOREACH gr GENERATE
 group as age,
 COUNT(users) as people_same_age;
• Unnesting data with FLATTEN operator:
 FLATTEN(tuple) => each tuple field to separate column
 FLATTEN(bag/map) => each bag/map item to separate row
• INNER JOIN: (LEFT/RIGHT/FULL OUTER also possible)
 JOIN age_counts by age, users by age;
• UNION users1, users2;
15/04/2015Contribute:SummerOfTechnologies
43
Pig Latin (cont’d)
• String Operators:
 SUBSTRING
 INDEXOF
 SIZE
 CONCAT
• Mathematical operators:
 MIN
 MAX
 AVG
 COUNT
 SUM
• Conditional logic with ternary operator: (age > 18 ? ‘adult’ : ‘child’);
15/04/2015Contribute:SummerOfTechnologies
44
Example script: min/max/avg
rain/month
• NOTE: this script will be translated into a single MR job => CODE REUSE!
15/04/2015Contribute:SummerOfTechnologies
45
Hadoop 4 you: environment?
• Possibilities to run your own Hadoop POC:
1. Develop locally using open source Jars:
 Eclipse IDE, IntelliJ
 Preferably linux environment or windows + cygwin
2. Testing/Demo:
 Setting up your own one-node cluster:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
 Use a preconfigured virtual machine by one of the vendors:
Cloudera VM, HortonWorks VM, MapR VM + VMware or VirtualBox
3. Real cluster
 Setting up your own cluster
 Elastic MapReduce service of Amazon
15/04/2015Contribute:SummerOfTechnologies
46
Elastic MapReduce in the cloud
• Hive
• Pig
• Impala
• Hadoop streaming
• Hadoop custom jar
15/04/2015Contribute:SummerOfTechnologies
47
Conclusion
• Pig is a high-level scripting language on top of MapReduce
• Pig stimulates code reuse
 => It creates a logical plan to run a script in the lowest number of MR jobs
• Pig is very easy to use
• Most people limit themselves to Pig/Hive (ex.: Yahoo!)
• MapReduce gives you full control and allows you to optimize complex jobs:
Word Co-Occurrence matrices
• Some relational operators can be hard to implement:
 How would you implement a JOIN?
15/04/2015Contribute:SummerOfTechnologies
48
Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
49
IV. Soccer Dataset:
Introduction & Metadata
Parsing
How does the data look like? Why use a Big Data approach? Parsing the
game metadata. Pig Exercise
15/04/2015Contribute:SummerOfTechnologies
50
2 Types of XML event streams
F7_*.xml x 90
• <SoccerDocument>
• <Competition> ... </Competition>
 <MatchData> goals, lineups, substitutions ...
</MatchData>
 <Team> home, players,... </Team>
 <Team> away... </Team>
 <Venue> ... </Venue>
• </SoccerDocument>
F24_*.xml x 90 (1500 events/game)
• <Game>
 <Event type, x,y, time, side, player,...>
 <Q id=... value=... />
 <Q id=... value=... />
 </Event>
 <Event ....
 ....
• </Game>
15/04/2015Contribute:SummerOfTechnologies
52
Why choose a Big Data approach
• The current Opta Sports dataset contains data from 2010-2014 with
(http://fivethirtyeight.com/features/lionel-messi-is-impossible/)
 16,574 players
 24,904 games (both league and international)
• Our sample dataset contains:
 90 games from Bundesliga 2 in 2008-2009
• Arguments?
 The real dataset IS big!
=>Implement scalable solution to start with!
 Processing in parallel is preferable
 Schema evolves over time
 Data is not relational
 Exploratory analysis: not sure what to look for?
 Fig.: result of Batch query
15/04/2015Contribute:SummerOfTechnologies
53
Our approach?
15/04/2015Contribute:SummerOfTechnologies
54
Parsing XML with StAXhttp://www.developerfusion.com/article/84523/stax-the-odds-with-woodstox/
15/04/2015Contribute:SummerOfTechnologies
55
Processing F7 files: Code Walktrough
15/04/2015Contribute:SummerOfTechnologies
56
Processing F7 files
• MR Job generates 4 types of records:
 games: “homeTeamID-awayTeamID 3-2”
 goals “teamID_playerID 1”
 teams “teamID* Freiburg”
 players “playerID Dieter De Witte”
• Assignment Reporting with Pig:
 Generate topscorer’s list (with playerID, teamID)
 Generate the team ranking (with teamID)
15/04/2015Contribute:SummerOfTechnologies
57
Pig Script F7 Walktrough
15/04/2015Contribute:SummerOfTechnologies
58
Results
15/04/2015Contribute:SummerOfTechnologies
59
Top Scoring
players Goals
5052 165 Antonio Di Salvo 7
28323 387 Milivoje Novakovic 7
20856 810 Daniel Gunkel 6
39274 683 Rob Friend 6
4124 683 Oliver Neuville 5
33057 810 Felix Borja 5
38100 2111 Maximilian Nicu 5
12846 680 Marvin Braun 4
11827 810 Markus Feulner 4
10112 683 Sascha Rösler 4
W D L + - Pts
1 683 Borussia Mönchengladbach 6 3 1 23 13 21
2 160 SC Freiburg 6 2 2 15 9 20
3 812 SpVgg Greuther Fürth 5 4 1 15 8 19
4 810 1. FSV Mainz 05 6 1 3 21 12 19
5 165 TSV München 1860 5 3 2 19 13 18
6 2111 SV Wehen Wiesbaden 5 2 3 21 19 17
7 387 1. FC Köln 4 3 3 19 16 15
8 818 Alemannia Aachen 4 2 3 13 10 14
9 1744 OFC Kickers 1901 4 2 4 11 16 14
10 1902 1899 Hoffenheim 4 2 3 13 12 14
11 680 FC St. Pauli 4 0 6 10 16 12
12 2012 TuS Koblenz 3 3 4 14 19 12
13 1755 VfL Osnabrück 3 2 4 10 14 11
14 1741 FC Erzgebirge Aue 3 1 5 14 15 10
15 1772 FC Augsburg 2 3 5 16 21 9
16 163 1. FC Kaiserslautern 1 4 5 7 11 7
17 1757 FC Carl Zeiss Jena 1 3 6 14 22 6
18 1743 SC Paderborn 0 4 6 5 14 4
Session Overview
I. Introduction
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design patterns
IV. Soccer dataset: Introduction & metadata parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data analysis:
VI.A Soccer game classification & prediction
VI.B Individual player analysis
VII. Wrap up
15/04/2015Contribute:SummerOfTechnologies
60
V. Introduction to Data
Science: Decision trees
What’s data science? Classification with decision trees and random
forests
15/04/2015Contribute:SummerOfTechnologies
61
The sexiest job of the 21st century! (Harvard
business review)
http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
• LinkedIn: focus on engineering => keep the
social network up and running!
• Jonathan Goldman: What would happen if you
presented users with names of people they
hadn’t yet connected with but seemed likely
to know? (> where you went to school, same
company,...)
• Result: Very high click-through rate on ‘People
you may know adds’
• DS = high-ranking professional with the training
and curiosity to make discoveries in the world of
big data
• LinkedIn from ‘empty box’ to 300 Million users
Skillset of a data scientist: (Google afbeeldingen)
Traditional Venn Diagram V2.0 Data science requires a team
Other representations
NSA/Thesis advisor Visualization!!
No clear definition
A Person who is better at statistics than any software engineer and better at
software engineering than any statistician
Josh Wills
Sr. Director of Data Science Cloudera
NLP Study
reveals:
“More data beats
better algorithms”
Algorithms for data science (ML)
• Machine learning algorithms are usually categorized as:
 supervized versus unsupervized (testset containing the ‘truth’ available?)
 input/output are categorical or continuous
• Supervized ex.: Classification (cat.) & Regression (cont.)
 Classification: Soccerdata->win/draw/loss
 Housing prizes versus their size
• Unsupervized ex.: Clustering & Collaborative filtering
 Can I divide my customers into certain segments based on their behaviour?
 Which books does Amazon recommend me based on my previous purchases or based on
what similar customers bought?
Classification
with
Decision Trees
15/04/2015Contribute:SummerOfTechnologies
68
Decision trees
• “A decision tree is a flowchart-like structure
in which an internal node represents a test
on an attribute, each branch represents the
outcome of the test and each leaf node
represents a class label”. (Wikipedia)
• Toy example how could we build an
optimal tree splitting people into
Male/Female based on their: length,
hip perimeter and the size of their
nose?
• => Put the question (rule) which
makes the best split first!
ID Length Hip perimeter Nose length Gender
1 155 70 2 F
2 160 80 2 F
3 165 80 3 F
4 170 75 3 F
5 180 90 2 F
6 190 65 3 M
7 200 60 2 M
8 195 55 3 M
9 185 50 3 M
10 175 50 10 M
Shannon entropy: the best split?
• To measure the best split we need an impurity measure: Shannon’s
information entropy (high entropy = high impurity)
• We have two classes: Male (M) & Female (F), 5 each (total T = 10)
• Entropy formula: S = - M/T * log2 M/T - F/T * log2 F/T
• Min(- X log X) = 0 when X=0 or X=1 => perfect split S = 0
• Initial set 5/10 males, 5/10 females: S = 0,34
• Suppose we split one nose length:
• Safter = Sleft + Sright = 0 + 0,31 = 0,31
• Information Gain = Sbefore – Safter = +0,03
• Note: pinnokio branch has S = 0 (completely pure)
Can we do better?
Split on length?
• Safter = Sleft + Sright = 0 + 0,21 = 0,21
• Information Gain = Sbefore – Safter
= +0,13
• Split on hip perimeter would be perfect
for this training set
• NOTE: training set is only a sample
of the universe!
Scatterplot
15/04/2015Contribute:SummerOfTechnologies
71
Democracy versus Totalitarism
• Decision trees are rather sensitive to the sample (overfitting!)
• An alternative for a single Decision trees is Random Forest
• A random forest classifier is an ensemble of decision trees BUT:
 Each tree recieves only a subset of the training data
 Each tree recieves only a subset of the features
• The classification is done by adding the class probabilities together
• How can this work?
 Indecisive trees with bad feature sets have probabilities close to 0,5 => have no
influence
 Example: Tree with only nose length (noselength <0,3): 44% male, 56% female
 Example: Tree with length (length > 1m70): 83% male, 17% female
• Side effect of Random Forest training: weights help select the dominant features
(feature selection = which features are upmost in the best trees?)
15/04/2015Contribute:SummerOfTechnologies
72
Python: SciKit library
• SciKit library contains machine learning algorithms
• Accuracy of forest on testset is 100%
15/04/2015Contribute:SummerOfTechnologies
73
Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
74
VI.A Soccer Data analyis:
Classification & Prediction
15/04/2015Contribute:SummerOfTechnologies
75
Classifying soccer games
• Outcome = Win/Draw/Loss => Classification
• Use 80 games as a training set, 10 games to evaluate classifier
• Feature vectors:
 1 FV = 1 game as (Home vs Away)
 Content of FV are soccer statistics: #shots on goal, #passes, #successful passes,
#offensive passes, ...
 NOTE: eliminate features which have perfect correlation with result: #goals, #assists
 Every feature has 2 values: #home passes, #away passes
(#home - #away) / (#home + #away) -> value in [-1,+1]
• NOTE: classification can only be done AFTER the game!
MR Job to extract feature vectors
• 55 features per game
• Mapper parses F24_*.xml and creates feature vector, No Reducer required
• Events are regular: contain a set of attributes & qualifiers
 Create an Event class with an attributes map and a qualifier map
 Create Filter classes to filter events:
AreaFilter, OutcomeFilter, EventIDFilter, QualifierFilter, DirectionFilter
 A function that splits a set of events into Home and Away
• Live demo
Accuracy 5/10 + 2 close calls
15/04/2015Contribute:SummerOfTechnologies
78
15/04/2015Contribute:SummerOfTechnologies
79
Results: feature selection
• Slightly modified MR job: Emit featurevectors for both teams
• Can we predict the outcome of a game based on their history?
• Averaging previous games + calculate feature vector values ( (x-y)/(x+y)
Using an RF classifier for prognosis?
• Use history (9 games) to generate
average feature vector
• Use weighted history to generate
average feature vector:
• F = (1*F1 + 2*F2 +3*F3 + ...) / (1+2+3+...)

15/04/2015Contribute:SummerOfTechnologies
80
Prognosis: results
15/04/2015Contribute:SummerOfTechnologies
81
Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
82
VI.B Soccer Data analyis:
Individual player analysis
Extract and visualize stats, rank players, does our rank reveal talented
players?
15/04/2015Contribute:SummerOfTechnologies
83
Extracting player stats with MR
• Select a number of features of interest: shots on target, passes, ...
• Mapper: extract these stats per game and emit (PlayerID, stats)
• Reducer: aggregate stats per player and emit (PlayerID, (agg_stats, #games))
• Pig: create a player score & player ranking
• Python: visualize player stats in scatter plots
• Live Demo
Generating scatterplots in python
15/04/2015Contribute:SummerOfTechnologies
85
Player scatterplots in python (1)
Player scatterplots in python (2)
Player scatterplots in python (3)
Player scatterplots in python (4)
Outliers in scatterplots
• What about the players excelling in the scatterplots?
• Two categories: Players > 23 and Players <= 23
• Players > 23: No remarkable career: Bundesliga 2 is their level
• Players <= 23: Currently all have a number of caps!
15/04/2015Contribute:SummerOfTechnologies
90
Carlos E. Marquez
Brazil
Kazan
Marko Marin
Germany
Chelsea->Sevilla
Chinedu Obasi
Nigeria
Schalke 04
Patrick Helmes
Germany
Wolfsburg
Nando Rafael
Angola
Düsseldorf
Player ranking in Pig
• OptaSports has Castroll Index to rank player performance
• Demo Pig: create player ranking based on 2 scores:
 Attacker_Score : shots_on_target / avg_sot
+ successful_dribbles / avg_sd
+ touches_in_square / avg_tis
 Allround_Score: Attacker_Score
+ successful_offensive_passes / avg_sop
+ successful_passes / avg_sp
 Suggestions?
Results: Attackers
Results: Allround
15/04/2015Contribute:SummerOfTechnologies
93
Session Overview
I. Introduction:
I.A Big Data Introduction
I.B Hadoop Distributions
II. Hadoop Architecture
III. Hadoop Ecosystem & Design Patterns
IV. Soccer dataset: Introduction & Metadata Parsing
V. Introduction to Data Science: Decision Trees
VI. Soccer Data Analysis:
VI.A Soccer Game Classification & Prediction
VI.B Individual Player Analysis
VII. Wrap Up
15/04/2015Contribute:SummerOfTechnologies
94
Wrap up
15/04/2015Contribute:SummerOfTechnologies
95
Conclusion Soccer analysis
• A random forest classifier can be used both for classification and prediction
• Feature selection tells you which features determine the result
 => improve your classifier by removing features or use in a different classification algorithm
• Classification accuracy is 50%, while 33% is expected by random
• The classification probabilities are very interesting:
 Removing the close calls the classification accuracy is 5/8 (62,5%)
 Removing the close calls improves the prognosis to 4/8 (50%) and 4/7(57%)
• Scatterplots are an easy tool to select promising players
• Scoring functions based on domain knowledge allow you to rank the players
15/04/2015Contribute:SummerOfTechnologies
96
General Conclusion
• Big Data is for real!
• The Hadoop ecosystem (PIG) makes Big Data accessible for a broader audience
• Big Data goes hand in hand with Data Science
• A data scientist requires a very broad skillset
• Number crunching is Hadoop’s task, while postprocessing is Python’s
• We introduced Decision trees and Random Forests
• Soccer games are hard to predict but promising players are easy to find
• The speaker likes ents and Pinnokio!?
15/04/2015Contribute:SummerOfTechnologies
98
Any Questions?

Mais conteúdo relacionado

Mais procurados

Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
Alexandru Iosup
 
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Alexandru Iosup
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil Jadhav
Swapnil (Neil) Jadhav
 

Mais procurados (20)

Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
Using hadoop for big data
Using hadoop for big dataUsing hadoop for big data
Using hadoop for big data
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmek
 
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
 
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar  : Talend : The Non-Programmer's Swiss Knife for Big DataWebinar  : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
 
My Spark Journey
My Spark JourneyMy Spark Journey
My Spark Journey
 
10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview Questions10 Popular Hadoop Technical Interview Questions
10 Popular Hadoop Technical Interview Questions
 
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2ODeep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2O
 
Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil Jadhav
 
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | EdurekaHadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
Hadoop Tutorial | What is Hadoop | Hadoop Project on Reddit | Edureka
 
BigData primer
BigData primerBigData primer
BigData primer
 
GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 

Semelhante a HadoopWorkshopJuly2014

Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Evert Lammerts
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
Praveen Sripati
 

Semelhante a HadoopWorkshopJuly2014 (20)

Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Big Data
Big DataBig Data
Big Data
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Gail Zhou on "Big Data Technology, Strategy, and Applications"
Gail Zhou on "Big Data Technology, Strategy, and Applications"Gail Zhou on "Big Data Technology, Strategy, and Applications"
Gail Zhou on "Big Data Technology, Strategy, and Applications"
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big data
Big dataBig data
Big data
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Combining hadoop with big data analytics
Combining hadoop with big data analyticsCombining hadoop with big data analytics
Combining hadoop with big data analytics
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
Big Data 2.0
Big Data 2.0Big Data 2.0
Big Data 2.0
 
BigData Meets the Federal Data Center
BigData Meets the Federal Data CenterBigData Meets the Federal Data Center
BigData Meets the Federal Data Center
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
 

Último

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 

Último (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 

HadoopWorkshopJuly2014

  • 1. Hadoop Session Contribute Summer of Technologies Dieter De Witte (Archimiddle) 15/04/2015Contribute:SummerOfTechnologies 1
  • 2. Session Overview I. Introduction: I.A Big Data Introduction I.B Hadoop Distributions II. Hadoop Architecture III. Hadoop Ecosystem & Design Patterns IV. Soccer dataset: Introduction & Metadata Parsing V. Introduction to Data Science: Decision Trees VI. Soccer Data Analysis: VI.A Soccer Game Classification & Prediction VI.B Individual Player Analysis VII. Wrap Up 15/04/2015Contribute:SummerOfTechnologies 2
  • 3. I.ABig Data Introduction History. What’s Big Data? What problems does it solve? Use cases. Comparison with RDBMs. 15/04/2015Contribute:SummerOfTechnologies 3
  • 4. How did it all start? • Google: Indexing the web?  Google File System (2003)  MapReduce (2004) • 2006: Doug Cutting joins Yahoo and gets a dedicated team to work on his Hadoop project • 2008: Hadoop becomes a top level Apache Project • 2008: Hadoop breaks Terasort record:  1TB, 910 nodes, 5 minutes  2009: 59s (1400 nodes)  2009: 3h (3400 nodes) 100TB sort 15/04/2015Contribute:SummerOfTechnologies 4
  • 5. How does it all evolve?http://blog.mikiobraun.de/2013/02/big-data-beyond-map-reduce-googles-papers.html Did Google sit back and relax? • 2006: BigTable • 2010: Percolator:  BigTable +individual updates & transactions • 2010: Pregel:  scalable graph computing • 2010: Dremel: interactive db (real-time) • 2011: MegaStore (BigTable + schema)  focus on distributed consistency • 2012: Spanner (MegaStore + SQL) Did the Open Source community? • HBase (facebook messaging service) • HBase dfdfdfdfdfddfddfdfdfdfdfdfdfdfdfdfdfdfdfdfdb • Apache Giraph, Neo4J b • Cloudera’s Impala 15/04/2015Contribute:SummerOfTechnologies 5 Doug Cutting: “Google is living a few years in the future and sending the rest of us messages”
  • 6. Today’s data challenges • Data momentum = Volume * Velocity • CAP Theorem Parallel/Cloud computing => not only BIG data => also complex data analysis! • DWH solutions are:  Expensive!  Not horizontally scalable  Inflexible schemas 15/04/2015Contribute:SummerOfTechnologies 6 • Data Variety  Data creation: humans <> machines
  • 7. Use cases? • Internet of Things: Everything has an IP • Customer behaviour analysis: Clickstreams,... • Social media analysis: Twitter, Facebook, ... • Fraud Detection: Sample -> full datasets, realtime • Cognitive computing: IBM Watson: Large scale text mining • Stack traces complex systems: Discovering system failure patterns • Energy: Centralized production -> distributed Smart Grids • Keyword: Personalized ...  Medicine  Janssen + Intel + Univ. = Exascience project  Drug prescription  Insurance  Advertising  Travelling • A different approach:  Pattern Discovery <> Pattern Matching 15/04/2015Contribute:SummerOfTechnologies 7
  • 8. Pattern matching versus Discovery • Hadoop is a Data Scientist’s playground: Explore your (big) data, discover new patterns • RDBMS works with Data Committee Which patterns do we want to store (schema) • Example from my research background: Where do certain DNA patterns occur? How to model DNA patterns? • Cooking analogy: Follow a recipe OR be the cook 15/04/2015Contribute:SummerOfTechnologies 8
  • 9. Big Data hype or reality? JOB AD University of Leuven, ESAT-STADIUS In the framework of a collaboration with Janssen Pharmaceuticals, we are looking for a talented postdoctoral researcher to develop kernel methods that link drug targets, disease phenotypes, and pharmaceutical compounds. Leveraging large-scale public and in-house data sets, you will develop kernel methods and/or network-based methods to predict potential links between targets, diseases, or candidate drugs. This research builds upon the expertise of Janssen Pharma and previous work of our team on genomic data fusion. The research will be also carried out with a team of the University of Linz, Austria (Prof. Sepp Hochreiter) specialized in kernel learning and chemoinformatics. Project Details:  Exascience project: Janssen, Imec, Intel, Universities,..  NGS Data ( 1 billion $ -> 2000$) -> mapping -> SNP dataset -> disease matching  Trend: Replace wet lab experiments by computer simulations 15/04/2015Contribute:SummerOfTechnologies 9
  • 10. Limitations of classical RDBMshttp://youtu.be/d2xeNpfzsYI?t=3m24s tot 5m00s Lecturer:AmrAwadallah (CTO + founder cloudera) 15/04/2015Contribute:SummerOfTechnologies 10 • Data streams: • DataSource -> Storage only (raw) • Storage layer => ETL => RDBMS => BI • 3 problems: • STORAGE TO ETL is problematic: Moving data to compute doesn’t scale • ETL typically overnight => not enough time too process all data! • Too much network overhead moving data from storage to compute grid • Solution? Move the code to where the data is! • STORAGE TO archiving: Archiving data too early = premature data death • Archiving too early since storage cost is too high (balance storage cost vs economic value) • archiving is cheap but retrieval is extremely expensive! • Solution? Storage has to become cheaper! (Return on byte) • STORAGE TO BI: No ability to explore the original raw data • You cannot ask NEW questions! Very inflexible!
  • 12. The left hand and the right hand Hadoop Schema on read Load is fast Schema’s can change Only Batch processing & no indexes CAP: No transactions! No atomic updates! Commodity Hardware Classical RDBMs 15/04/2015Contribute:SummerOfTechnologies 12 Schema on write Load is slow (ETL first) Adapting schema is very difficult Read is fast (schema => indexing) Very good at transactions CRUD Expensive (purpose) Data Warehouse
  • 13. The end of my presentation? http://www.businessweek.com/articles/2014-06- 27/google-just-made-big-data-expertise-much-tougher-to-fake 15/04/2015Contribute:SummerOfTechnologies 13 For the last five years or so, it’s been pretty easy to pretend you knew something about Big Data. You went to the cocktail party—the one with all the dudes—grabbed a drink and then said “Hadoop” over and over and over again. People nodded. Absurdly lucrative job offers rolled in the next day. Simple. Well, Google (GOOG) officially put an end to the good times this week. During some talks at the company’s annual developer conference, Google executives declared that they’re over Hadoop. It’s yesterday’s buzzword. Anyone who wants to be a true Big Data jockey will now need to be conversant in Flume, MillWheel, Google Cloud Dataflow, and Spurch. (Okay, I made the last one up.) Here’s the deal. About a decade ago, Google’s engineers wrote some papers detailing a new way to analyze huge stores of data. They described the method as MapReduce: Data was spread in smallish chunks across thousands of servers; people asked questions of the information; and they received answers a few minutes or hours later. Yahoo! (YHOO) led the charge to turn this underlying technology into an open-source product called Hadoop. Hundreds of companies have since helped establish Hadoop as more or less the standard of modern data analysis work. (startups as Cloudera, Hortonworks, and MapR have their own versions of Hadoop that companies can use, and just about every company that needs to analyze lots of informatioMuch has been written on this topic.) Such n has its own Hadoop team. Google probably processes more information than any company on the planet and tends to have to invent tools to cope with the data. As a result, its technology runs a good five to 10 years ahead of the competition. This week, it is revealing that it abandoned the MapReduce/Hadoop approach some time ago in favor of some more flexible data analysis systems. One of the big limitations around Hadoop was that you tended to have to do “batch” operations, which means ordering a computer to perform an operation in bulk and then waiting for the result. You might ask a mainframe to process a company’s payroll as a batch job, or in a more contemporary example, analyze all the search terms that people in Texas typed into Google last Tuesday. According to Google, its Cloud Dataflow service can do all this while also running data analysis jobs on information right as it pours into a database. One example Google demonstrated at its conference was an instantaneous analysis of tweets about World Cup matches. You know, life-and-death stuff. Google has taken internal tools—those funky-named ones such as Flume and MillWheel—and bundled them into the Cloud Dataflow service, which it plans to start offering to developers and customers as a cloud service. The promise is that other companies will be able to deal with more information easier and faster than ever before. While Google has historically been a very secretive company, it is opening up its internal technology as a competitive maneuver. Google is proving more willing than, say, Amazon.com (AMZN) to hand over the clever things built by its engineers to others. It’s an understandable move, given Amazon’s significant lead in the cloud computing arena. As for the Hadoop clan? You would think that Google flat-out calling it passé would make it hard to keep hawking Hadoop as the hot, hot thing your company can’t live without. And there’s some truth to this being an issue. That said, even the biggest Hadoop fans such as Cloudera have been moving past the technology for some time. Cloudera leans on a handful of super-fast data analysis engines like Spark and Impala, which can grab data from Hadoop-based storage systems and torture it in ways similar to Google’s. The painful upshot, however, is that faking your way through the Big Data realm will be much harder from now on. Try keeping your Flume and Impala straight after a couple of gin and tonics.
  • 14. Sidenotes: 15/04/2015Contribute:SummerOfTechnologies 14 • Hadoop = Distributed storage (HDFS) + Compute layer (MapReduce) • HDFS is NOT QUESTIONED! • Cloudera is already providing Spark training
  • 15. Session Overview I. Introduction: I.A Big Data Introduction I.B Hadoop Distributions II. Hadoop Architecture III. Hadoop Ecosystem & Design Patterns IV. Soccer dataset: Introduction & Metadata Parsing V. Introduction to Data Science: Decision Trees VI. Soccer Data Analysis: VI.A Soccer Game Classification & Prediction VI.B Individual Player Analysis VII. Wrap Up 15/04/2015Contribute:SummerOfTechnologies 15
  • 16. Cloudera, Hortonworks, MapR (and Intel’s Hadoop Distribution) 15/04/2015Contribute:SummerOfTechnologies 16 I.B Hadoop Distributions
  • 17. Forrester Wave: Enterprise hadoop solutions (2012) 15/04/2015Contribute:SummerOfTechnologies 17
  • 18. Performance of vendors on common queries 15/04/2015Contribute:SummerOfTechnologies 18
  • 19. Intel Makes Significant Equity Investment in Cloudera 15/04/2015Contribute:SummerOfTechnologies 19 $740M Cloudera Investment PALO ALTO, Calif., and SANTA CLARA, Calif., March 27, 2014 – Intel Corporation and Cloudera today announced a broad strategic technology and business collaboration, as well as a significant equity investment from Intel making it Cloudera’s largest strategic shareholder and a member of its board of directors. This is Intel’s single largest data center technology investment in its history. The deal will join Cloudera’s leading enterprise analytic data management software powered by Apache Hadoop™ with the leading data center architecture based on Intel® Xeon® technology. The goal is acceleration of customer adoption of big data solutions, making it easier for companies of all sizes to obtain increased business value from data by deploying open source Apache Hadoop solutions. Both the strategic collaboration and the equity investment are subject to standard closing conditions, including customary regulatory approvals. Cloudera will develop and optimize Cloudera’s Distribution including Apache Hadoop (CDH) for Intel architecture as its preferred platform and support a range of next-generation technologies including Intel fabrics, flash memory and security. In turn, Intel will market and promote CDH and Cloudera Enterprise to its customers as its preferred Hadoop platform. Intel will focus its engineering and marketing resources on the joint roadmap. The optimizations from Intel’s Distribution for Apache Hadoop/Intel Data Platform (IDH/IDP) will be integrated into CDH and IDH/IDP and will be transitioned after v3.1 release at the end of March. To ensure a seamless customer transition to CDH, Intel and Cloudera will work together on a migration path from IDH/IDP. Cloudera will also ensure that all enhancements will be contributed to their respective open source projects and CDH. ...
  • 20. Session Overview I. Introduction: I.A Big Data Introduction I.B Hadoop Distributions II. Hadoop Architecture III. Hadoop Ecosystem & Design Patterns IV. Soccer dataset: Introduction & Metadata Parsing V. Introduction to Data Science: Decision Trees VI. Soccer Data Analysis: VI.A Soccer Game Classification & Prediction VI.B Individual Player Analysis VII. Wrap Up 15/04/2015Contribute:SummerOfTechnologies 20
  • 21. II. Hadoop Architecture HDFS & MR architecture Figures: Hadoop in practice, Data-intensive Text Processing, Hadoop: the definitive guide 15/04/2015Contribute:SummerOfTechnologies 21
  • 22. Hadoop: The operating system for data clusters • Hadoop is a scalable fault-tolerant distributed system for data storage and processing: Failure is the rule not the exception! 15/04/2015Contribute:SummerOfTechnologies 22
  • 23. Hadoop Distributed Filesystem • HDFS is optimized for streaming reads and writes:  HDFS uses large blocksizes (64 or 128 MB) => HD seek time negligible (compared to read/write))  HDFS replicates its data blocks (usually 3 times) to improve availability and fault tolerance (resilient against node failure!) • HDFS: No Updates! • Master-Slave: Namenode-Datanode 15/04/2015Contribute:SummerOfTechnologies 23
  • 24. Hadoop reads 15/04/2015Contribute:SummerOfTechnologies 24 • NameNode is single point of failure • DataNodes store the blocks & report block health to NameNode • NameNode stores: • File – Block mapping • Block – Node mapping
  • 25. Hadoop writes 15/04/2015Contribute:SummerOfTechnologies 25 • DataNode reports write succes • Replication taken care of by pipeline pattern to avoid NameNode bottleneck!
  • 26. MapReduce execution engine 15/04/2015Contribute:SummerOfTechnologies 26 • Master-Slave: JobTracker-TaskTracker • JT schedules map and reduce tasks on TaskTrackers • JT tries to schedule the work near the data => Move algorithm NOT data • JT sends heartbeats to TTs to check health • If a task fails 4 times => JOB failure • If a TT fails 4 times => Removed from pool • A TT has Map & Reduce slots to run M & R tasks • Anything can be configured!
  • 27. Data flow 15/04/2015Contribute:SummerOfTechnologies 27 3 Key Value transformations: (k1,v1) (line nr, text) (k2,v2) (word, frequency) (k3,v3) (word, list<freq>)
  • 28. Ser/De in Hadoop • A Mapper processes an InputSplit = { (k1,v1), (k2,v2), (k3,v3), ... } • How is an InputSplit defined? • InputFormat class splits input and RecordReader generates KV pairs from split (TextInputFormat: split = slice of file, record = line of text) • InputFormat tries to make splits = file blocks in HDFS => DATA LOCALITY • Custom InputFormats possible : JSON, XML, SequenceFiles, ... • Hadoop has its own serialization types: WritableComparables (Text, IntWritable, FloatWritable, BytesWritable,...) 15/04/2015Contribute:SummerOfTechnologies 28
  • 30. Session Overview I. Introduction: I.A Big Data Introduction I.B Hadoop Distributions II. Hadoop Architecture III. Hadoop Ecosystem & Design Patterns IV. Soccer dataset: Introduction & Metadata Parsing V. Introduction to Data Science: Decision Trees VI. Soccer Data Analysis: VI.A Soccer Game Classification & Prediction VI.B Individual Player Analysis VII. Wrap Up 15/04/2015Contribute:SummerOfTechnologies 30
  • 31. III. Hadoop Ecosystem & Design patterns Intro to MapReduce programming; Using the ecosystem to stimulate code reuse; 15/04/2015Contribute:SummerOfTechnologies 31
  • 32. Hadoop Word Count Problem Description • Given a set of documents, calculate the number of times each word occurs. • A MapReduce program consists of:  A Driver  A Mapper (optional)  A Reducer (optional) DRIVER CODE 15/04/2015Contribute:SummerOfTechnologies 32
  • 33. Hello World Hadoop: Word Count • WordCountMapper:  The InputFormat partitions the data in a set of splits.  Each split is fed to a Mapper  A RecordReader generates records = Text Value  The Mapper takes a record an splits the text into words  The Mapper emits every word with frequency one. • WordCountReducer:  Hadoop is responsible for getting all key-value pairs with the same key to one reducer (parallel sort)  A reducer gets a collection of values which go with a single key (frequencies)  This reducer adds up the frequencies and emits the sum to a file specified in setOutputPath(...) 15/04/2015Contribute:SummerOfTechnologies 33
  • 34. Optimization patterns in MRData-Intensive Text Processing (Lin,Dyer) Use of Combiners • A combiner is a mini-reducer which can be run an arbitrary number of times on the output of a mapper before it is streamed to disk In-Mapper Combiner design pattern • Create a HashMap in your Mapper class in which you store all (word,frequency) pairs, emit when mapper has finished with a split • Drawback: Hashmap must fit in memory! 15/04/2015Contribute:SummerOfTechnologies 34 • Each Context.write() streams data to the local filesystem! (BOTTLENECK!) • HINT: Is it necessary to emit every single (word, 1) pair from the mapper?
  • 35. NLP: Word co-occurrence matricesData-Intensive Text Processing (Lin,Dyer) Problem Description • For each word we want to calculate the relative frequencies of words co- occurring with this word (in the same sentence). • Example:  (dog,cat) occurs 2 times, (dog,walking) occurs 3 times then the results should be: => (dog,cat) = 40%, (dog,walking) = 60% • Requirements?  We must count all word co-occurrences & sum all the (dog,*) combinations to calculate the relative frequencies MAPPERv1 15/04/2015Contribute:SummerOfTechnologies 35
  • 36. Relative frequencies! Problem • We need to know how much time dog occurs together with any other word in orde to calculate the relative frequencies! • Solution emit (dog,*) pairs as well! MAPPERv2 15/04/2015Contribute:SummerOfTechnologies 36
  • 37. How to get the data to the reducer? Problem • (dog,*), (dog,cat), (dog,walking) are different keys, they might end up in a different reducer!!! • MapReduce possibilities  MapReduce has a Partitioner which decides where each K,V must go  MapReducer has a GroupingComparator to decide which K,V pairs end up in one reducer group  MapReduce has a SortComparator to decide on how to sort the K,V pairs in the reducer group 15/04/2015Contribute:SummerOfTechnologies 37
  • 39. Limitations of MapReduce Problem Description • MapReduce does NOT stimulate code reuse!  Suppose we have a table and we want to calculate the average, the minimum and the maximum  This can be done with 1 job but to make your code reusable you need 3!  A max job, a min job and an avg job • MapReduce requires a lot of coding! • Relational operators such as Joins, Orders,... should only be written once! Solution? • On top of MapReduce 2 scripting languages are build which allow one to use relational logic which is then translated in a sequence of MapReduce jobs:  PIG and HIVE  Pig is a Dataflow language: single data transformation  Hive is an sql-like language  Bottom line: you can use hadoop without knowledge of mapreduce!!!!!! 15/04/2015Contribute:SummerOfTechnologies 39
  • 41. Pig’s philosophyhttp://pig.apache.org/philosophy.html • Pigs eat anything  relational, nested, unstructured,... data • Pigs live anywhere  Hadoop is not strictly required • Pigs are domestic animals  integration with other languages (Python)  extendible with UDFs • Pigs fly  optimizes its translation to MR jobs 15/04/2015Contribute:SummerOfTechnologies 41
  • 42. Pig Latinhttp://pig.apache.org/docs/r0.8.1/piglatin_ref2.html • Data types: same as SQL + {tuples, bags, maps} • LOAD ‘path’ USING PigStorage(delim) AS ... (schema) • STORE users INTO ‘path’ USING PigStorage(delim) • DISTINCT operator: only keep unique records • FILTER operator:  FILTER users BY age == 30; • SPLIT  SPLIT users INTO adults IF age >= 18,  children OTHERWISE; • ORDER  ORDER users BY age DESC/ASC; 15/04/2015Contribute:SummerOfTechnologies 42
  • 43. Pig Latin (cont’d) • FOREACH users GENERATE (projection + operations between columns)  name,  age; • Nesting data with GROUP BY operator:  gr = GROUP users BY age;  age_counts = FOREACH gr GENERATE  group as age,  COUNT(users) as people_same_age; • Unnesting data with FLATTEN operator:  FLATTEN(tuple) => each tuple field to separate column  FLATTEN(bag/map) => each bag/map item to separate row • INNER JOIN: (LEFT/RIGHT/FULL OUTER also possible)  JOIN age_counts by age, users by age; • UNION users1, users2; 15/04/2015Contribute:SummerOfTechnologies 43
  • 44. Pig Latin (cont’d) • String Operators:  SUBSTRING  INDEXOF  SIZE  CONCAT • Mathematical operators:  MIN  MAX  AVG  COUNT  SUM • Conditional logic with ternary operator: (age > 18 ? ‘adult’ : ‘child’); 15/04/2015Contribute:SummerOfTechnologies 44
  • 45. Example script: min/max/avg rain/month • NOTE: this script will be translated into a single MR job => CODE REUSE! 15/04/2015Contribute:SummerOfTechnologies 45
  • 46. Hadoop 4 you: environment? • Possibilities to run your own Hadoop POC: 1. Develop locally using open source Jars:  Eclipse IDE, IntelliJ  Preferably linux environment or windows + cygwin 2. Testing/Demo:  Setting up your own one-node cluster: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/  Use a preconfigured virtual machine by one of the vendors: Cloudera VM, HortonWorks VM, MapR VM + VMware or VirtualBox 3. Real cluster  Setting up your own cluster  Elastic MapReduce service of Amazon 15/04/2015Contribute:SummerOfTechnologies 46
  • 47. Elastic MapReduce in the cloud • Hive • Pig • Impala • Hadoop streaming • Hadoop custom jar 15/04/2015Contribute:SummerOfTechnologies 47
  • 48. Conclusion • Pig is a high-level scripting language on top of MapReduce • Pig stimulates code reuse  => It creates a logical plan to run a script in the lowest number of MR jobs • Pig is very easy to use • Most people limit themselves to Pig/Hive (ex.: Yahoo!) • MapReduce gives you full control and allows you to optimize complex jobs: Word Co-Occurrence matrices • Some relational operators can be hard to implement:  How would you implement a JOIN? 15/04/2015Contribute:SummerOfTechnologies 48
  • 49. Session Overview I. Introduction: I.A Big Data Introduction I.B Hadoop Distributions II. Hadoop Architecture III. Hadoop Ecosystem & Design Patterns IV. Soccer dataset: Introduction & Metadata Parsing V. Introduction to Data Science: Decision Trees VI. Soccer Data Analysis: VI.A Soccer Game Classification & Prediction VI.B Individual Player Analysis VII. Wrap Up 15/04/2015Contribute:SummerOfTechnologies 49
  • 50. IV. Soccer Dataset: Introduction & Metadata Parsing How does the data look like? Why use a Big Data approach? Parsing the game metadata. Pig Exercise 15/04/2015Contribute:SummerOfTechnologies 50
  • 51.
  • 52. 2 Types of XML event streams F7_*.xml x 90 • <SoccerDocument> • <Competition> ... </Competition>  <MatchData> goals, lineups, substitutions ... </MatchData>  <Team> home, players,... </Team>  <Team> away... </Team>  <Venue> ... </Venue> • </SoccerDocument> F24_*.xml x 90 (1500 events/game) • <Game>  <Event type, x,y, time, side, player,...>  <Q id=... value=... />  <Q id=... value=... />  </Event>  <Event ....  .... • </Game> 15/04/2015Contribute:SummerOfTechnologies 52
  • 53. Why choose a Big Data approach • The current Opta Sports dataset contains data from 2010-2014 with (http://fivethirtyeight.com/features/lionel-messi-is-impossible/)  16,574 players  24,904 games (both league and international) • Our sample dataset contains:  90 games from Bundesliga 2 in 2008-2009 • Arguments?  The real dataset IS big! =>Implement scalable solution to start with!  Processing in parallel is preferable  Schema evolves over time  Data is not relational  Exploratory analysis: not sure what to look for?  Fig.: result of Batch query 15/04/2015Contribute:SummerOfTechnologies 53
  • 55. Parsing XML with StAXhttp://www.developerfusion.com/article/84523/stax-the-odds-with-woodstox/ 15/04/2015Contribute:SummerOfTechnologies 55
  • 56. Processing F7 files: Code Walktrough 15/04/2015Contribute:SummerOfTechnologies 56
  • 57. Processing F7 files • MR Job generates 4 types of records:  games: “homeTeamID-awayTeamID 3-2”  goals “teamID_playerID 1”  teams “teamID* Freiburg”  players “playerID Dieter De Witte” • Assignment Reporting with Pig:  Generate topscorer’s list (with playerID, teamID)  Generate the team ranking (with teamID) 15/04/2015Contribute:SummerOfTechnologies 57
  • 58. Pig Script F7 Walktrough 15/04/2015Contribute:SummerOfTechnologies 58
  • 59. Results 15/04/2015Contribute:SummerOfTechnologies 59 Top Scoring players Goals 5052 165 Antonio Di Salvo 7 28323 387 Milivoje Novakovic 7 20856 810 Daniel Gunkel 6 39274 683 Rob Friend 6 4124 683 Oliver Neuville 5 33057 810 Felix Borja 5 38100 2111 Maximilian Nicu 5 12846 680 Marvin Braun 4 11827 810 Markus Feulner 4 10112 683 Sascha Rösler 4 W D L + - Pts 1 683 Borussia Mönchengladbach 6 3 1 23 13 21 2 160 SC Freiburg 6 2 2 15 9 20 3 812 SpVgg Greuther Fürth 5 4 1 15 8 19 4 810 1. FSV Mainz 05 6 1 3 21 12 19 5 165 TSV München 1860 5 3 2 19 13 18 6 2111 SV Wehen Wiesbaden 5 2 3 21 19 17 7 387 1. FC Köln 4 3 3 19 16 15 8 818 Alemannia Aachen 4 2 3 13 10 14 9 1744 OFC Kickers 1901 4 2 4 11 16 14 10 1902 1899 Hoffenheim 4 2 3 13 12 14 11 680 FC St. Pauli 4 0 6 10 16 12 12 2012 TuS Koblenz 3 3 4 14 19 12 13 1755 VfL Osnabrück 3 2 4 10 14 11 14 1741 FC Erzgebirge Aue 3 1 5 14 15 10 15 1772 FC Augsburg 2 3 5 16 21 9 16 163 1. FC Kaiserslautern 1 4 5 7 11 7 17 1757 FC Carl Zeiss Jena 1 3 6 14 22 6 18 1743 SC Paderborn 0 4 6 5 14 4
  • 60. Session Overview I. Introduction I.A Big Data Introduction I.B Hadoop Distributions II. Hadoop Architecture III. Hadoop Ecosystem & Design patterns IV. Soccer dataset: Introduction & metadata parsing V. Introduction to Data Science: Decision Trees VI. Soccer Data analysis: VI.A Soccer game classification & prediction VI.B Individual player analysis VII. Wrap up 15/04/2015Contribute:SummerOfTechnologies 60
  • 61. V. Introduction to Data Science: Decision trees What’s data science? Classification with decision trees and random forests 15/04/2015Contribute:SummerOfTechnologies 61
  • 62. The sexiest job of the 21st century! (Harvard business review) http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ • LinkedIn: focus on engineering => keep the social network up and running! • Jonathan Goldman: What would happen if you presented users with names of people they hadn’t yet connected with but seemed likely to know? (> where you went to school, same company,...) • Result: Very high click-through rate on ‘People you may know adds’ • DS = high-ranking professional with the training and curiosity to make discoveries in the world of big data • LinkedIn from ‘empty box’ to 300 Million users
  • 63. Skillset of a data scientist: (Google afbeeldingen) Traditional Venn Diagram V2.0 Data science requires a team
  • 65. No clear definition A Person who is better at statistics than any software engineer and better at software engineering than any statistician Josh Wills Sr. Director of Data Science Cloudera
  • 66. NLP Study reveals: “More data beats better algorithms”
  • 67. Algorithms for data science (ML) • Machine learning algorithms are usually categorized as:  supervized versus unsupervized (testset containing the ‘truth’ available?)  input/output are categorical or continuous • Supervized ex.: Classification (cat.) & Regression (cont.)  Classification: Soccerdata->win/draw/loss  Housing prizes versus their size • Unsupervized ex.: Clustering & Collaborative filtering  Can I divide my customers into certain segments based on their behaviour?  Which books does Amazon recommend me based on my previous purchases or based on what similar customers bought?
  • 69. Decision trees • “A decision tree is a flowchart-like structure in which an internal node represents a test on an attribute, each branch represents the outcome of the test and each leaf node represents a class label”. (Wikipedia) • Toy example how could we build an optimal tree splitting people into Male/Female based on their: length, hip perimeter and the size of their nose? • => Put the question (rule) which makes the best split first! ID Length Hip perimeter Nose length Gender 1 155 70 2 F 2 160 80 2 F 3 165 80 3 F 4 170 75 3 F 5 180 90 2 F 6 190 65 3 M 7 200 60 2 M 8 195 55 3 M 9 185 50 3 M 10 175 50 10 M
  • 70. Shannon entropy: the best split? • To measure the best split we need an impurity measure: Shannon’s information entropy (high entropy = high impurity) • We have two classes: Male (M) & Female (F), 5 each (total T = 10) • Entropy formula: S = - M/T * log2 M/T - F/T * log2 F/T • Min(- X log X) = 0 when X=0 or X=1 => perfect split S = 0 • Initial set 5/10 males, 5/10 females: S = 0,34 • Suppose we split one nose length: • Safter = Sleft + Sright = 0 + 0,31 = 0,31 • Information Gain = Sbefore – Safter = +0,03 • Note: pinnokio branch has S = 0 (completely pure)
  • 71. Can we do better? Split on length? • Safter = Sleft + Sright = 0 + 0,21 = 0,21 • Information Gain = Sbefore – Safter = +0,13 • Split on hip perimeter would be perfect for this training set • NOTE: training set is only a sample of the universe! Scatterplot 15/04/2015Contribute:SummerOfTechnologies 71
  • 72. Democracy versus Totalitarism • Decision trees are rather sensitive to the sample (overfitting!) • An alternative for a single Decision trees is Random Forest • A random forest classifier is an ensemble of decision trees BUT:  Each tree recieves only a subset of the training data  Each tree recieves only a subset of the features • The classification is done by adding the class probabilities together • How can this work?  Indecisive trees with bad feature sets have probabilities close to 0,5 => have no influence  Example: Tree with only nose length (noselength <0,3): 44% male, 56% female  Example: Tree with length (length > 1m70): 83% male, 17% female • Side effect of Random Forest training: weights help select the dominant features (feature selection = which features are upmost in the best trees?) 15/04/2015Contribute:SummerOfTechnologies 72
  • 73. Python: SciKit library • SciKit library contains machine learning algorithms • Accuracy of forest on testset is 100% 15/04/2015Contribute:SummerOfTechnologies 73
  • 74. Session Overview I. Introduction: I.A Big Data Introduction I.B Hadoop Distributions II. Hadoop Architecture III. Hadoop Ecosystem & Design Patterns IV. Soccer dataset: Introduction & Metadata Parsing V. Introduction to Data Science: Decision Trees VI. Soccer Data Analysis: VI.A Soccer Game Classification & Prediction VI.B Individual Player Analysis VII. Wrap Up 15/04/2015Contribute:SummerOfTechnologies 74
  • 75. VI.A Soccer Data analyis: Classification & Prediction 15/04/2015Contribute:SummerOfTechnologies 75
  • 76. Classifying soccer games • Outcome = Win/Draw/Loss => Classification • Use 80 games as a training set, 10 games to evaluate classifier • Feature vectors:  1 FV = 1 game as (Home vs Away)  Content of FV are soccer statistics: #shots on goal, #passes, #successful passes, #offensive passes, ...  NOTE: eliminate features which have perfect correlation with result: #goals, #assists  Every feature has 2 values: #home passes, #away passes (#home - #away) / (#home + #away) -> value in [-1,+1] • NOTE: classification can only be done AFTER the game!
  • 77. MR Job to extract feature vectors • 55 features per game • Mapper parses F24_*.xml and creates feature vector, No Reducer required • Events are regular: contain a set of attributes & qualifiers  Create an Event class with an attributes map and a qualifier map  Create Filter classes to filter events: AreaFilter, OutcomeFilter, EventIDFilter, QualifierFilter, DirectionFilter  A function that splits a set of events into Home and Away • Live demo
  • 78. Accuracy 5/10 + 2 close calls 15/04/2015Contribute:SummerOfTechnologies 78
  • 80. • Slightly modified MR job: Emit featurevectors for both teams • Can we predict the outcome of a game based on their history? • Averaging previous games + calculate feature vector values ( (x-y)/(x+y) Using an RF classifier for prognosis? • Use history (9 games) to generate average feature vector • Use weighted history to generate average feature vector: • F = (1*F1 + 2*F2 +3*F3 + ...) / (1+2+3+...)  15/04/2015Contribute:SummerOfTechnologies 80
  • 82. Session Overview I. Introduction: I.A Big Data Introduction I.B Hadoop Distributions II. Hadoop Architecture III. Hadoop Ecosystem & Design Patterns IV. Soccer dataset: Introduction & Metadata Parsing V. Introduction to Data Science: Decision Trees VI. Soccer Data Analysis: VI.A Soccer Game Classification & Prediction VI.B Individual Player Analysis VII. Wrap Up 15/04/2015Contribute:SummerOfTechnologies 82
  • 83. VI.B Soccer Data analyis: Individual player analysis Extract and visualize stats, rank players, does our rank reveal talented players? 15/04/2015Contribute:SummerOfTechnologies 83
  • 84. Extracting player stats with MR • Select a number of features of interest: shots on target, passes, ... • Mapper: extract these stats per game and emit (PlayerID, stats) • Reducer: aggregate stats per player and emit (PlayerID, (agg_stats, #games)) • Pig: create a player score & player ranking • Python: visualize player stats in scatter plots • Live Demo
  • 85. Generating scatterplots in python 15/04/2015Contribute:SummerOfTechnologies 85
  • 90. Outliers in scatterplots • What about the players excelling in the scatterplots? • Two categories: Players > 23 and Players <= 23 • Players > 23: No remarkable career: Bundesliga 2 is their level • Players <= 23: Currently all have a number of caps! 15/04/2015Contribute:SummerOfTechnologies 90 Carlos E. Marquez Brazil Kazan Marko Marin Germany Chelsea->Sevilla Chinedu Obasi Nigeria Schalke 04 Patrick Helmes Germany Wolfsburg Nando Rafael Angola Düsseldorf
  • 91. Player ranking in Pig • OptaSports has Castroll Index to rank player performance • Demo Pig: create player ranking based on 2 scores:  Attacker_Score : shots_on_target / avg_sot + successful_dribbles / avg_sd + touches_in_square / avg_tis  Allround_Score: Attacker_Score + successful_offensive_passes / avg_sop + successful_passes / avg_sp  Suggestions?
  • 94. Session Overview I. Introduction: I.A Big Data Introduction I.B Hadoop Distributions II. Hadoop Architecture III. Hadoop Ecosystem & Design Patterns IV. Soccer dataset: Introduction & Metadata Parsing V. Introduction to Data Science: Decision Trees VI. Soccer Data Analysis: VI.A Soccer Game Classification & Prediction VI.B Individual Player Analysis VII. Wrap Up 15/04/2015Contribute:SummerOfTechnologies 94
  • 96. Conclusion Soccer analysis • A random forest classifier can be used both for classification and prediction • Feature selection tells you which features determine the result  => improve your classifier by removing features or use in a different classification algorithm • Classification accuracy is 50%, while 33% is expected by random • The classification probabilities are very interesting:  Removing the close calls the classification accuracy is 5/8 (62,5%)  Removing the close calls improves the prognosis to 4/8 (50%) and 4/7(57%) • Scatterplots are an easy tool to select promising players • Scoring functions based on domain knowledge allow you to rank the players 15/04/2015Contribute:SummerOfTechnologies 96
  • 97. General Conclusion • Big Data is for real! • The Hadoop ecosystem (PIG) makes Big Data accessible for a broader audience • Big Data goes hand in hand with Data Science • A data scientist requires a very broad skillset • Number crunching is Hadoop’s task, while postprocessing is Python’s • We introduced Decision trees and Random Forests • Soccer games are hard to predict but promising players are easy to find • The speaker likes ents and Pinnokio!?