This document proposes a big data infrastructure and analytics solution using Hadoop. It discusses (1) constructing a Hadoop cluster on two physical machines, (2) transmitting both structured and unstructured data to HDFS, and (3) performing reporting, analysis, monitoring, and prediction using Hive, HBase, and Mahout. Experimental results show the Hadoop components running and sample queries executing successfully. Future work involves validating the infrastructure with real-world data and further predictive analytics research.
Big Data Infrastructure and Analytics Solution on FITAT2013
1. BIG DATA INFRASTRUCTURE AND
ANALYTICS SOLUTION
Erdenebayar Erdenebileg, Oyun-Erdene Namsrai
School of Information Technology, National University of Mongolia
erdenebayar.erdenebileg@gmail.com, oyunerdene@num.edu.mn
3. Introduction
• BIG DATA is coming from structured
and unstructured information (Web
data, market purchases, Credit card
transactions …)
• BIG DATA: 10% is structured data, But
90% is unstructured data
• Nowadays, almost every organization
is facing BIG DATA problems in
Mongolia.
• They need to analyze and predict their
valuable information
School of Information Technology, National University of Mongolia
Why?
How?
FITAT/ISPM 2013
5. Big Data: 3V’s
We are facing big data problem
with Volume, Variety, Velocity
reasons:
• Transactional data is growing day
by day
• Storing different types of data
• Need to be processed fast
Real Time
Data Velocity
(Fast analyzing requirement)
Near Real Time
Periodic
Batch
Unstructured
Video
Table
Database
GB
Web
Social
Data Variety
MB
Photo
Audio
Mobile
TB
PB
(Many types of data)
Data Volume
(Large amount of data)
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
7. How to solve problem?
To provide BI and Analytic tool
Full solution is
1. To construct BIG DATA
infrastructure
2. To find and develop data
transmission tools
3. To implement warehousing and
mining tools and techniques
4. To provide BI and Analytic tool
To implement warehousing and
mining tools and techniques
To construct BIG DATA
infrastructure
School of Information Technology, National University of Mongolia
To find and develop data
transmission tools
Data Sources
(Structured,
Semi-structured,
Unstructured)
FITAT/ISPM 2013
9. RDMBS based infrastructure
From my experimental :
• Optimization requires more
cost (Licenses and Server), but
open source RDBMS is not
fitted with license
• RDBMS is not good with more
than gigabyte data
• It is not compatible to store
unstructured data (video, audio
etc…)
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
10. HADOOP based infrastructure
From the biggest companies
experience (Facebook, Yahoo,
Twitter …), main advantages
are :
• Distributed File System
paradigm
• Powerful parallel computing
framework (MapReduce)
• It can be store any type of
data, which are structured,
semi-structured, unstructured
data
• It is Open source and easy to
integrate Hadoop related
products
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
11. Brief introduction: HDFS Architecture
NameNode
BackupNode
Balancing, Replication, Failover
DataNode
DataNode
DataNode
DataNode
Data Node stores in local disks
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
12. Brief introduction : MapReduce framework
Job Tracker
2010
2011
2012
2013
1. We have a big GREEN data
3. Aggregation and calculation data
2. Data will separate to the different
server
4. Consolidated result to the client
Task Tracker /
Server
Task Tracker /
Server
School of Information Technology, National University of Mongolia
Task Tracker /
Server
Task Tracker /
Server
FITAT/ISPM 2013
13. Proposed method & solution
It is Hadoop and open source technologies
14. Proposed method selection (Hadoop stacks)
Proposed method selected with following reason:
• Data should be stored in Distributed system
• Aggregation and calculation should be done in parallel computing
paradigm
• Data type is structured and unstructured data, which are mobile
call detailed record
• Data size is about 20TB
• Method should be Open source technologies
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
15. Full Infrastructure (3 main method)
Client Machine (Jasper Business Intelligence)
Client software
(Reporting tool)
JasperRepors Server
Hive connector
Machine 1 (Slave Hadoop)
HBase connector
Machine 2 (Master Hadoop)
Clustered Big Data Infrastructure and Data Processing
Physical Machine (Resources)
Data Sender
Data resources
Sensor Data (Phone, Web Log, Camera etc…)
Structured Data
Big Data
Infrastructure
Semi -Unstructured Data
School of Information Technology, National University of Mongolia
Unstructured Data
FITAT/ISPM 2013
16. Method 1: Clustered Big Data Infrastructure and Data Processing
• First task is configuring BIG DATA infrastructure with Analytic products
• This configuration clustered with TWO machine (Physical machine)
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
17. Method 2: Data transmission way
• Data resources consist RDBMS
and unstructured data (CDR file,
video …)
• If structured data stores such as
Relational databases, we need
Sqoop product for bulk data
transfer
• If unstructured data stores such
as video and file, we need custom
application development using
HDFS client (SSH)
•
•
School of Information Technology, National University of Mongolia
Manual data transfer way
Automatic data transfer way
(Custom application)
FITAT/ISPM 2013
18. Method 3: Analytics solution over the BIG DATA
This is the main method and trying to solve
following concepts
Predictive Analytics
They are focusing now
Prediction
(What will happen?)
Complexity
Business Intelligence
Almost every
organizations are
doing now
Monitoring
(What is happening now?)
Analysis
(Why did it happen?)
Reporting
(What happened?)
Business value
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
19. Method 3: Analytics solution over the BIG DATA
• This is describes how to Reporting, Analyzing, Monitoring and Predict over the
BIG DATA infrastructure
Hadoop Distributed File System (Resources)
Sensor
Data
Hive
Table
HBase
Table
Hive Warehouse Data
Hive Table
Summarization
(Reporting, Analyzing,and analysis
Creation
Monitoring)
Hive Query
Language (HQL)
Direct Access To
HDFS
HBase table
management
HBase Table
Creation
(Reporting,
Analyzing,
Monitoring)
Aggregated data
Ad-hoc query
Sensor
Data
Mined
Data
Mahout Machine
Mahout Machine
LearningMining) Data
and
Learning (Data
Thrift
Server
HBase query
Mining
(Prediction)
Direct Access To
HDFS
End User (Analytic Tool)
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
21. Experimental results
Experimental work focused on following main job:
1. Install and configure BIG DATA infrastructure (Clustered 2
physical machine)
2. Import sample unstructured data to the HDFS using SSH (to the
Big data infrastructure)
3. Ran sample HiveQL query, HBase query and Mahout job over
the MapReduce framework
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
22. Running and monitoring HDFS and MapReduce framework
Sample results: HDFS and MapReduce
Master Machine:
DataNode, JobTracker,
NameNode, SNN,
TaskTracker are running
Slave Machine:
DataNode, TaskTracker
are running
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
23. Running and working Hive warehouse
Sample results: Hive warehouse and HiveQL
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
24. Running and working HBase table management
Sample results: HBase table management and Rest-ful web service
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
25. Future work and Conclusion
Keep continue data mining research
26. Future work
Keep continue my research work about BIG DATA
and Analytic solution:
1. Validate proposed infrastructure with real world data
(Mobile call logs, Camera sensor)
2. Keep research new technology to support to our
architecture
3. Predict and analyze real data over the infrastructure
(Market basket analyze, recommendation etc…)
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
27. Conclusion
1. This is the full analytics solution for Analyzing big data
over the Hadoop Distributed File System:
-
Reporting (What happened?) (Hive)
-
Analysis (Why did it happen?) (Hive, HBase)
-
Monitoring (What happening now?) (Hive)
-
Predict (What will happen?) (Mahout)
School of Information Technology, National University of Mongolia
FITAT/ISPM 2013
Good afternoon, Dear professors and teachers and students,My name is Erdenebayar, who is master student of School of Information Technology, National University of MongoliaI am very appreciate to have the chance to introduce our research work. It is one of my important moment of my life. Today I will introduce my research work about Big Data infrastructure and analytics solution
This is the main topics
First of all, I’ll introduce why I’m researching big data and analytic work.In Mongolia ….. Nowadays …..Because I’m working on Data Management team at one Software Development company and discussed with biggest customers (Government and Business companies).
Currently we are facing big data problem with Volume, Variety, Velocity reasons.First one is Volume: Transactional data is growing day by day (MB, GB, TB, PB, ZB)Second one is Variety: It mainly about data types. Lot of different devices storing different type of dataLast one is Velocity: Every business companies need to analyze and process very fast to do future business
Exactly we can decide Big Data problem and Business companies need with following way:This picture shows conceptual solution for that.
In this topic, I will describe some method and comparison of different methodology.We can store big data (data) on the RDBMS and NoSQL Database.
Hadoop product consists two main product, which are Hadoop Distributed File System and Data Processing MapReduce Framework.I will briefly introduce these two product
I would like to thank you my Professor Oyun-Erdene, She always couch and teach me all of cases.
Thank you for your attention.If you have any question, I would be happy to answer