Mais conteúdo relacionado Semelhante a Advanced Analytics using Apache Hive (20) Advanced Analytics using Apache Hive1. Analytics using Apache Hive
with the power of Windowing
and Table functions:
Use Cases
Murtaza Doctor - murtaza@richrelevance.com
Principal Architect, RichRelevance
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
2. Outline
•
•
•
•
•
•
•
{rr} story
What is Clickstream Analytics
Hive at {rr}
Windowing & PTF Framework
Case Study: use cases
Current, Next & Future
Q&A
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
4. RichRelevance DataMesh
Data Ingestion
3rd
Party
Realtime
Customer
Data Store
Analytics &
Optimization
Clickstream
Catalog
Online sales
In-store sales
Ad impressions
Social profiles
Redemptions…
125+ models
Customer models
Product models
A/B, MVT testing
King-of-the-hill
optimization
Offline
Data Feeds
Real-time
Decisioning
(65 msec)
[Client]
Innovation
Cloud
Event
Triggered
(minutes)
Batch
Updates
(hours)
Reporting
(ad
hoc, OLAP, E
xcel)
Underlying Technologies:
Hadoop, HBase, Hive, Kafka, Avro, Voldemort, Postgres, Pentaho OLAP, R
Custom apps and APIs
Self-Serve
Analytics
Personalized
Category Sort
Real-time
Segmentation
Network Ad
Tracking
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
{rr} SaaS & APIs
5. Did You Know?
Our data capacity
includes a 1.5 PB
Hadoop
infrastructure, which
enables us to employ
100+ algorithms in realtime
Our cloud-based platform
supports both real-time
processes and analytical
use cases, utilizing
technologies to name a
few:
Crunch, Hive, HBase, Avro,
Azkaban, Voldemort, Kafka
In the US, we serve 7000
requests per second
with an average
response time of
50 ms
Someone clicks on a {rr}
recommendation
every 21 milliseconds
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
7. What is Clickstream Analytics?
•
•
•
•
•
Collect, Combine, Aggregate & Analyze
Clickstream – view, click, purchase events
It is all about the Session or Visit
User properties – userId, location etc
Site Optimization, Sentiment Analysis, Buying Patterns and
many more
Example: we use click through rate (clicks/sessions) to
measure how well ad placement positions are doing on
pages, and then can test them based on engagement to see if
other positions would work better.
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
8. Getting MAD on Hive
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
9. MAD skills from Hellerstein’s paper
From Developer perspective Data
Platform should be:
Magnetic – Attract Data
As opposed to
With Hive
Having to justify loading any new
data to a DBA. The Quality and
schema regimes have the
adverse effect of Repelling Data
Agile – Data comes in many
Forcing a complex ETL process to
shapes and forms. Enable
bring data in.
bringing in Data in its native form.
Pluggable
• Formats
• Storage
Handlers
• Indices
Deep – Ability to operate on data Only SQL
directly; using existing algorithms
that operate on native formats.
SQL + M/C Learning + Graph + …
SQL + Map
Reduce scripts.
But can we do
better?
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
10. Problem for App Developer
I want to do
•
•
•
•
•
•
Sessionization
Clustering
Collaborative Filtering
Fraud detection
Time Series Analysis
Churn Analysis
And I want to do combine these analysis with SQL
Analytic capabilities available in most Databases as:
•
•
•
•
User Defined Table Functions
External Table mechanisms etc.
Aster SQL/MR library provides functions for many of the Use Cases above
Oracle Stored Procedure + Table Functions used to provide Analytic
packages.
Our work: bring same capability to Hive.
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
11. Hive at {rr}
•
•
•
•
•
Real-time data in Hive
Getting to 1PB of data in Hive!
Hive Tables: Event types, Catalog, Rollups etc
Custom Serde
Partitioning scheme: most of the tables
partitioned by event date
{rr}
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
13. Roadblocks to the Solution
•
•
•
•
•
•
Too many temporary tables
Random sampling
R for ranking & aggregate functions
R can only handle smaller data sets
Lots of self-joins
Inefficient queries
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
14. Welcome to PTFs and
Windowing
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
15. 3 Major SQL Concepts
1. Table Function
• Enable injecting custom logic
into the Query Data Flow
• Contract for TF is TableIn/Table-Out
• So opens up analysis
beyond row calculations
and aggregations
• Sessionize Fn. that decides
what weblog entries belong to
a Session.
• Syntactically function can
appear anywhere a Table can
in SQL.
Project
tableOut
Table Function
tableIn
Join
Select
2. Partitioned Table Function
Select
Project
Partitions Out
Table Function
Partitions In
Join
Select
Select
• a scaling mechanism
• Instead of operating on
the entire table divide
work into Partitions
• instances operating on
individual Partitions
don’t communicate.
• Divide weblog by Day or
Week and operate
independently
• Intuitively like MR:
processing PTF done
as MR jobs.
3. Windowing
current
row
• Operate on a set of rows
surrounding the current
row
• Windows defined like „5
preceding and 4 succeeding‟
• On the window allow
aggregations; and also
Navigation: lead, lag, First,
Last
PTFS and Windows related
• You do windowing after everything else: join, group by etc.
• You define windows on ordered Partitions
• You then do aggregations, inter row navigations on these
windows
• If all the Partitions across all Window expressions are the
same, then this is a special PTF.
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
16. Ordered Partition: the central concept
select *
From Sessionize(weblog….)
Partition
Hive
Translator
In Functions:
• Analyze partition of rows as a unit
• Output is not a summary of rows
• Sessionization : relate events to
sessions.
• Market Basket: find most common
Product/Page combinations
In Windowing:
• Ranking: Rank, Tiling,
• Trending: Lead/Lag,
Cumulative Sum
SELECT ViewsData.*,
rank() as exit_rank
over(DISTRIBUTE BY sessionid
SORT BY timsetamp DESC),
FROM ViewsData
Hive
Translator
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Output
Partition
17. Example: Time Series Analysis
Time Series Analysis: Identify Flights that have a delay problem.
• We want to look at all the times a Flight happened and then make a
judgment.
• To do this: one conceivable starting point is to find occurrences where a
Flight was late 3 or more times in a row.
• Use these as a starting point for further analysis.
Flights Table
Origin
Fl. Num
Year
Month
Day
Arr. Delay
Boston
1017
2010
10
25
59.37
Boston
1017
2010
10
26
58.14
Boston
1017
2010
10
28
30.83
Boston
1017
2010
10
29
25.67
Pittsburgh
1058
2010
12
26
82.62
Analysis rows by Fl.
Number. Look for
sequences of Late
incidents.
Origin
FlNum
Year
Boston
1017
Boston
Pittsburg
h
Output aggregation
statistics about
these sequences.
Day
2010
Mont
h
10
25
Avg.
Delay
59.37
Num Of
Delays
8
1017
2010
11
10
41.54
7
1058
2010
12
26
82.62
8
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
18. Use NPath PTF
Use a PTF: NPath
•
•
•
Helps you look for patterns in Time
User specifies Labels: interesting conditions, for e.g. LATE : arr_delay > 15 mins
Then specifies Patterns on Labels. Patterns are simple Regexes. For e.g.
•
•
LATE.LATE.LATE+ look for occurrences where a flight is 3 or more times late.
On Occurrences found (Occurrences are a set of rows) specify aggregation
calculations. For e.g.
•
•
Average Delay among late occurrences
Number of delays
3.
1. Query on Flights Table
select origin_city_name, fl_num, year, month, day_of_month, sz, tpath
from NPATH(
'LATE.LATE+',
'LATE', arr_delay > 15,
'origin_city_name, fl_num, year, month, day_of_month, size(tpath) as numDelay, arrAvg(tpath, “arrDelay”)
as avgDelay'
on
flights
distribute by fl_num
Looking at data
sort by year, month, day_of_month
per Flight; order
)
2.
within partition by
time
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
• Arg. 1 specify PATTERN
• Arg. 2 specify conditions as
LABELS
• Arg. 3 specify AGGR.
EXPRESSIONS
19. Runtime: PTF execution
Hive
Translator
Input DataSet
MR Job
Map Splits
Map Task
Rows
Table
Sc+an
Rows
Select
Partition
Reduce Task
Rows
Join
PTF
Shuffle controlled by
partition and order
specification
FileSink
Partition
Function
A PartitionedTableFunction (PTF)
given a Partition computes an output
Partition.
An invocation of PTF specifies how input
dataset should be partitioned and ordered.
A PTF defines shape of Output.
A PTF may operate on raw data before it is
partitioned and ordered.
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
20. {rr} Case Study on Windowing
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
21. Case I: Landing/Exit Page Rate
• First page the user lands on within a session
• Last page the user exits through a session
• Landing rate:
distribution of landing events by page type
• Exit rate:
distribution of exit events by page type
• Usage: SEO & Advertising
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
22. Case I: Landing/Exit Page
SELECT eventdate, landPage, exitPage, COUNT(DISTINCT sessionid)
FROM (
SELECT sessionid, eventdate,
first_value(pageType) over (partition by sessionid) as landPage,
last_value(pageType) over(partition by sessionid) as exitPage
FROM (
SELECT pageType, eventdate, sessionid, timestamp,
count(*) over(PARTITION
BY sessionid order by timestamp asc) as c,
rank() over(PARTITION
BY sessionid order by timestamp asc) as r
FROM views
WHERE siteid = 1 and
eventdate >= '2013-01-01' and evendate < '2013--01-13'
)a
WHERE r = 1 or r = c
)b
GROUP BY eventdate, landing_page, exit_page
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
23. Case I: Landing Page Breakdown
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
24. Case I: Landing Page Time Series
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
25. Case I: Exit Page Time Series
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
26. Case II a: Bounce Rate (by Page Type)
• Single page in session
• Landing Page is equal to Exit Page
• Usage: Site engagement metrics report
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
27. Case II a: Bounce Rate (by Page Type)
SELECT page_type, eventdate,
sum(case when c=1 then 1 else 0 end) as bounce_count,
count(1) as total_sessions
FROM (
SELECT page_type, eventdate, sessionid, timestamp,
count(*) over(PARTITION BY sessionid, eventdate order by
timestamp asc) as c,
rank() over(PARTITION BY sessionid, eventdate order by
timestamp asc) as bounce_rank
FROM views
WHERE siteid = 1 and
eventdate >= '2013-01-01' and evendate < '2013-01-13'
)a
WHERE bounce_rank = 1
GROUP by page_type, eventdate
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
28. Case II a: Bounce Rate (by Page Type)
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
29. Case II a: Bounce Rate Time Series
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
30. Case II b: New versus Repeat Traffic
• Comparison metric between first time visitors to
site v/s who came back more than once
• Usage: Insights into audience optimization
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
31. Case II b: New vs Repeat Traffic
SELECT userid, siteid, eventdate,
sum(case when c=1 then 1 else 0 end) new_users,
sum(case when c>1 then 1 else 0 end) repeat_users
FROM (
SELECT userid, siteid, eventdate,
count(*) over(PARTITION BY userid, siteid order by
eventdate as c,
rank() over(PARTITION BY userid, siteid order by
eventdate ) as rank
FROM views
WHERE siteid = 1 and
eventdate >= '2013-01-01' and eventdate < '2013-01-14’
) page_views
WHERE rank = 1
GROUP BY userid, siteid, eventdate;
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
32. Case III: Path to Purchase
• Most commonly taken path which leads to a
purchase
• Example: search page item page add to
cart purchase
• Usage: Site Optimization, Attribution Models
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
33. Case III: Path to Purchase
SELECT sessionid, eventdate,
collect_set(page_type) as path_to_purchase
FROM (
SELECT sessionid, eventdate, page_type,
last_value(page_type) over(PARTITION BY sessionid, eventdate
order by timestamp) as last_page
FROM (
SELECT sessionid, eventdate, timestamp, 'purchase' as page_type
FROM purchases
WHERE siteid=999 and eventdate = '2013-01-01'
UNION ALL
SELECT sessionid, eventdate, timestamp, page_type
FROM views
WHERE siteid = 1 and eventdate = '2013-01-01'
)a
)b
WHERE
last_page = 'purchase'
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
34. Case IV: Most Frequent Next Action
• Path a user takes, speaks a lot about user
experience
• Next most common action
• Example: Search item page
• Usage: Site Optimization
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
35. Case IV: Most Frequent Next Action
SELECT page_type, next_page_type, c
FROM(
SELECT sessionid, page_type,
lead(page_type,1) OVER (PARTITION BY sessionid sort by
timestamp asc) as next_page_type,
count(*) OVER (PARTITION BY sessionid sort by
timestamp asc) as c,
rank() ) OVER (PARTITION BY sessionid sort by
timestamp asc) as page_view
FROM views where siteid = 1 and eventdate='2013-01-01‟
)a
GROUP BY page_type, next_page_type;
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
36. Case IV: Most Frequent Next Action
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
37. Case V: Purchase Co-Occurrence
• People who bought X also bought Y
• List of products more frequently bought in the
same orders as a user specified list of products
• Usage: Provides behavioral insights that would
not surface in sales metrics
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
38. Case V: Purchase Co-Occurrence
SELECT siteid, eventdate, userid, sessionid, ip, timestamp, ordernumber,
prods (
SELECT siteid, eventdate, userid, sessionid, ip, timestamp,
ordernumber, prods.productid as productid sum(case when
find_in_set(prods.productid, 'P1,P2,P3') > 0 then 1 else 0)
OVER (PARTITION BY purchase_complete_page. ordernumber
rows between unbounded preceding and unbounded following) as
matches, collect_set(prods.productid)
OVER(PARTITION BY purchase_complete_page.ordernumber
rows between unbounded preceding and unbounded following) as
prods, rank() OVER (PARTITION BY
purchase_complete_page.ordernumber
rows between unbounded preceding and unbounded following) as r
FROM purchases explode(purchase_complete_page.productspurchased)
prodTable as prods
WHERE eventdate >= $P{startdate} and
eventdate <= $P{enddate} and
siteid = $P{siteid}
)
WHERE matches >= 3 and r = 1
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
40. Solution: Current
History of this project (started by Harish Butani)
• First provided this functionality on top of Hive
• See Github project for details & Hadoop Summit talk from Harish Butani
on this
• Had more functions and features, but not ideal
• So started to fold into Hive in November 2012
• 3 patches for HQL: see Jira 896
• A separate „windowing & ptf‟ hive branch
Hive Journey
•
•
•
•
Available as HiveQL
Currently part of Hive 0.11
Equivalent to functionality provided by Postgres
Differences are documented in Jira 4197
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
41. Solution: Next
Solidify Infrastructure
• Performance improvements
• Dynamic Registration of PTFs.
More Functions
• Candidate Frequent Itemsets: key process in Market Basket Analysis
• TimeLine: another kind of time series analysis, based on a
RichRelevance use case.
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
42. Solution: Future
Use PTF mechanism to integrate:
• R as R script PTF
• Mahout functions as Mahout PTF
• Groovy script PTF
Reduce Task
Rows
Join
Query structure:
Select ….
From Rscript(
‘r script’
on Npath(args…
On Flights..
)
)
rFn.
PTF
FileSink
rJava
rEngine
Npath identifies interesting incidents
Use R to make final decision
Partition
R Data Frame
Multi pass PTF Operator:
• Enable Iterative Algorithms:
Clustering, Market basket
Analysis, Graph traversal etc.
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Notas do Editor Hive – DataWarehouse System for hadoopHow Harish & I met and we decided to collaborate How we plan to go over stuff Nuggets or Data Points1.5PB not as big as yahoo or facebook – huge from a retail industry perspective Site Optimization and others are just few of the use cases which can be solved by leveraging ClickStream Analytics Hive usage at {rr} So the picture in your mind should be:- The user specifies a Function in SQL anywhere a Table can appear- Behind the scenes: at runtime the Function is responsible for taking a Partition & returning a Partition.Or:- user specifies one or more Windowing expressions- behind the scenes the internal Windowing Table Function processes the data, partition by partition.Windowing and PTF infrastructure is the same Npath get the example from Hive - One last thing, a quick picture of runtime- Here is now PTFs fit into the Hive flow.- A Query is translated in a set of Jobs by the Hive Driver.- Within each task, one or more SQL Operators are executed.- These operate on a stream of rows.- For PTFs a new PTF Operator gets injected into the reduce side. - It collects rows in a partition into a Partition object and invokes the PTF Function.- Whose job is to provide an output Partition; whose rows get injected back into the stream of rows. Fluent way to do things RANK function Inner query selects a certain set of fields partitions the data by sessionId and sorts views in that session by timestamp or order in which they have occurred starting with the first one. This query then only selects the first event of that session and that comes from rank=1Outer query groups the data by page_type and applies the count aggregate function to the sessionId Example just does a countLanding events are pages where referral id is not NULLGoogle landing events in a session item page - non bounce pageSessions which have one row one where rank() = 1If you want to compute by a session using a time – you are computing a difference between the frist & last – FIRST & LAST value Highlighting that the window does not have be number range It can be value basedIn a row in a session you want to look ahead: what some one time every activity Timeline function – Table Functions lot more leeway: some kind of pathing just like NPATH How is it different from last one- Lead function - cannot pivot the value 0 fundamental pattern are the same How about the following:If I understand the schema, the query below should give you the Orders andthe products purchased that contain all the listed products.So say the products you are looking for are 'P1,P2,P3', then the sum willgive you a count of the products in this Order that match one of thelisted products.The having clause will filter out all Orders that don't have at least 3matches (I.e. Matching all the listed products)The r = 1 condition will return 1 row per order.The o/p is of the form:OrderNumber, {products in order as a set}, other detailsŠCan of course return each product in the Order as a separate row if youwant to do more aggregation. For e.g count the orders that these productsappear in and then rank them or set up a cutoff threshold etc. Notes: R and SQLThis would bring a different wayPull data into RPush R functionality where data is?Who is thinking about this future?