The slides cover Map Reduce and Hadoop as basic technologies for Big Data processing. Based on this, the Hadoop ecosystem is explained along with extensions and concepts such as Lambda Architecture for real-time event-processing. The presentation ends with giving an outlook on future technologies.
3. 3 / 31
M/R: Motivation
● Process
big amount
of data to
produce
other data
● Scale up vs
Scale out
4. 4 / 31
M/R: What is it?
● Different programming paradigm
● Based on a google paper (2004)
● Automatic parallelization and distribution
● I/O Scheduling
● Fault tolerance
● Status and monitoring
5. 5 / 31
M/R: The paradigm
● Input & Output: set of key/value pairs
● Big amount of data group & sort
● Job = Two phases = Mapper & Reducer
● Map (in_key, in_value) →
list(interm_key, interm_value)
● Reduce (interm_key,
list(interm_value)) →
list (out_key, out_value)
9. 9 / 31
Hadoop: What is it?
● Framework based on GMR / GFS
● Apache project
● Developed in Java
● Multiple applications
● Used by many companies
● De-facto standard in community
11. 11 / 31
Hadoop: HDFS concepts
● Distributed file system. Layer
on top ext3, xfs...
● Works better on huge files
● Redundancy (default 3)
● Bad seeking, no append!
● Good rack scale. Not good
data center scale
● File divided in 128Mb –
256Mb blocks
● Computation is sent to data!
30. 30 / 31
New trends: Spark
● Next generation MapReduce
● Integrated but not dependent on Hadoop
● Fast memory optimized execution engine
● Avoids many Hadoop problems
● Overhead
● High latency
● Many disk writes
● In-memory cache
● Flexible executions graph
● Much faster than MapReduce (up to 100x)
● Shark (SQL)
● Support streaming (beta)
31. 31 / 31
BIG DATA TECHNOLOGY
● Juanjo Mostazo
● juanj.mostazo@gmail.com
● http://www.slideshare.net/juanjmostazo/mr-hadoop-cbase