Real Time Analytics using Cloudera Impala in Manufacturing use case

Final Project

Real Time Analytics

using Cloudera Impala in Manufacturing use case

Rapheephan Thongkham-uan (Nancy)
CSCI E-185 Big Data Analytics

@Rapheephan Thongkham-Uan

Friday, May 10, 13

To make Big Data makes Money
In manufacturing, ...

•

We want to improve the supply chain management by tracking the defective
parts, ﬁnding the bottlenecks, etc.

•

We are doing the analysis on the big amount of data using traditional tools that
takes too much time.

•
•

People in the factory are familiar to SQL query.
The faster we analyze the big data,

-

faster defects/bottlenecks detection
near real-time problem solving, decision-making
less time and money spending on the defects

That’s why we need Cloudera Impala

Friday, May 10, 13

Requirements

•

Cloudera Manager 4.5.2 installation guide
-

•

•

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Free/latest/ClouderaManager-Free-Edition-Installation-Guide/Cloudera-Manager-Free-Edition-Installation-Guide.html

My VM

-

Ubuntu 12.04 (Precise) 64-bits
CDH 4.2
Cloudera Management 4.5.2

I installed Impala via Cloudera Manager


Friday, May 10, 13

After ﬁnishing cloudera manager installation

Friday, May 10, 13

We will use Hue Web UI to query Impala

From the Services
menu bar, click
HUE1
and choose Hue
Web UI

Friday, May 10, 13

Create table in Hive
Create Hive table with user impala then load the data from local into the table
$ sudo -E -u impala hive -e “CREATE TABLE khsample (id INT, sdate
STRING, seq INT, product STRING, ope STRING, resource_grp STRING,
resource STRING, inflow FLOAT, proclot FLOAT, wip FLOAT, ope_rate
FLOAT) ROW FORMAT DELIMITED FILEDS TERMINATED BY ‘,’;”
$ sudo -E -u impala hive -e “LOAD DATA LOCAL INPATH ‘KH_RESULT.csv’
INTO TABLE khsample;”


Friday, May 10, 13

Sample table in Hue Web UI
We can view the table we just created in Hive shell on Hue Web UI
*the input data is included japanese characters which cannot be read.


Friday, May 10, 13

Create table in Hive
Before querying Impala on Hue Web UI, we have to refresh the Impala ﬁrst. In
the Impala-shell, input the following command
$ impala-shell
[impala-server:21000] > refresh;


Friday, May 10, 13

Query in Impala
In Hue Web UI, click Impala icon the query editor page will be shown.
input the query and execute


Friday, May 10, 13

Bottlenecks query
-

To ﬁnd the groups of machines which are the bottlenecks, we can calculate
from WIP by day. The group of machines which WIP value is higher than the day
before can be predicted as bottleneck.

-

The simulation dates were from 12/13 to 12/22. I will get the summation of
WIP values from the sampling dates (12/14, 12/16, 12/18, 12/20, 12/22).

-

We have to do 5 sub-queries in FROM statement.


Friday, May 10, 13

Bottlenecks query (2)
SELECT A.resource_grp,

(SELECT resource_grp, sum(wip) as dwip

A.awip as wip22, --12/22 wip

FROM khsample

B.bwip as wip20, --12/20 wip

WHERE id = 118 and sdate =’”2012/12/16”’) D join

C.cwip as wip18, --12/18 wip

(SELECT resource_grp, sum(wip) as ewip

D.dwip as wip16, --12/16 wip

FROM khsample

D.dwip as wip14 --12/14 wip

WHERE id = 118 and sdate =’”2012/12/14”’) E

FROM (SELECT resource_grp, sum(wip) as awip

WHERE A.resource_grp = B.resource_grp

FROM khsample

and A.resource_grp = C.resource_grp

WHERE id = 118 and sdate =’”2012/12/22”’) A join

and A.resource_grp = D.resource_grp

(SELECT resource_grp, sum(wip) as bwip

and A.resource_grp = E.resource_grp

FROM khsample

and A.awip >= B.bwip and B.bwip >= C.cwip

WHERE id = 118 and sdate =’”2012/12/20”’) B join

and C.cwip >= D.dwip and D.dwip >= E.ewip

(SELECT resource_grp, sum(wip) as cwip

ORDER BY A.awip DESC

FROM khsample

LIMIT 20;

WHERE id = 118 and sdate =’”2012/12/18”’) C join

Friday, May 10, 13

Comparing the result of Impala with Oracle SQL


Friday, May 10, 13

Results
• join 5 sub-queries in Oracle SQL took 50s.
• join 5 sub-queries in Impala took 6.67s.
• Impala can query 7x faster with the same
results.

• In the real use, we could conﬁgure Impala
to work with HBase, also change Hive
Metastore to OracleDB.

Friday, May 10, 13

Real Time Analytics using Cloudera Impala in Manufacturing use case

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Real Time Analytics using Cloudera Impala in Manufacturing use case