4. Current state
Big data - is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using traditional data processing
applications.
www.luxoft.com
5. Limitations & Problems
www.luxoft.com
Big data is difficult to work with using
most relational databases, requiring
instead massively parallel software
running on tens, hundreds, or even
thousands of servers
eBay.com uses two data warehouses at 7.5 petabytes
Walmart handles more than 1 million customer
transactions every hour
Facebook handles 50 billion photos from its user base
In 2012, the Obama administration announced the Big
Data Research and Development Initiative
7. CORE HADOOP - MapReduce
In 2004, Google published a paper on a process called MapReduce
www.luxoft.com
DISTRIBUTED
COMPUTING
FRAMEWORK
Process large jobs in
parallel across many
nodes and combine the
results
8. Hadoop Structure
www.luxoft.com
HDFS is a distributed file system designed to run on commodity hardware
HBase store data rows in labelled tables (sortable key and an arbitrary number of columns)
Hive provide data summarization, query, and analysis (SQL-like interface)
Pig is a platform for analyzing large data sets that consists of a high-level language
9. Hadoop vs RDBMS
www.luxoft.com
Hadoop RDBMS
Performance for relational data
Machine query optimization
Mature workload management
High concurrency interactive query
processing
Schema-less Model
Human query optimization
Ability to create complex dataflow
with multiple inputs and outputs
Parallelize many Analytic Functions
How might this change in the future
Query Optimization Improvements in Hive
– Statistics, better join ordering, more join types, etc
Startup Time Improvements
– Simpler query plans to pass out
Runtime Performance Improvements
13. Luxoft Big Data R&D
Hadoop as ETL Data Quality tool
www.luxoft.com
BENEFITS
Reduced TCO (commodity hardware usage)
Traceability of all the data quality issues
Hadoop becomes clean data tool.
PROBLEM
Traditional tools show poor performance in exception
and data cleansing.
SOLUTION
Hadoop transforms the data into single format and
processes it using data cleansing workflows.
14. Summary
Big Data:
Cutting edge of DI technologies
State-of-the-art design approaches
A bit more than simple development, it's some of art, art
of data management
www.luxoft.com