More Related Content
Similar to Hive at LinkedIn (20)
Hive at LinkedIn
- 3. ©2013 LinkedIn Corporation. All Rights Reserved.
Agenda
LinkedIn Data and its Ecosystem
Performance Improvements – Avro
User experiences
3
- 4. ©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn Data Sources
Event Data
– Page Views
– Clicks
– Search queries
Database Data
– Profile (Users & Companies)
– Connections
External Data
– Salesforce, DoubleClick
4
- 5. ©2013 LinkedIn Corporation. All Rights Reserved.
Member Data
(Profiles)
Espresso
and RDBMS
External
Partner Data
Member Activity
(Page views,
button clicks)
Kafka Topics
Front-end
Serving
Systems
Member-facing
systems
Lots of cool stuff
not in this picture!
Where's the Data at LinkedIn?
© 2013 LinkedIn 24 June 2013
Data Ecosystem at LinkedIn
5
Member
Facing
Systems
- 10. ©2013 LinkedIn Corporation. All Rights Reserved.
Data in Hadoop
Almost all LinkedIn data is stored in Hadoop
Tools used
– Hive/HCatalog
– Pig
– Java MapReduce
– Azkaban
10
- 11. ©2013 LinkedIn Corporation. All Rights Reserved.
Hive Usage
Use-cases
– Ad-hoc query
– Reporting
– Building Platforms
Segmentation Engine
Experimentations Engine
Users
– Data Scientist
– Business Analytics
– Security team
– Product team
11
- 12. ©2013 LinkedIn Corporation. All Rights Reserved.
Hive Challenges
Performance
– Faster query execution
Performance
– Faster query execution
Efficient MR* execution plan
– Effective resource usage
– Ensure cluster stability
12
- 13. ©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn Hive Initiatives
Make HCatalog work and deploy [OnGoing]
Hive Performance Improvement (Avro data reading) [On
Going]
Stabilize Hive Server 2 at LI [About to Start]
Expand the scope of HCatalog metadata [Planning]
13
- 14. ©2013 LinkedIn Corporation. All Rights Reserved.
HCatalog Initiatives
Expand scope of meta-data
– Who creates this data?
– What are the inputs?
Helpful to create data lineage
– Who is the maintainer of data?
14
- 16. ©2013 LinkedIn Corporation. All Rights Reserved.
What is the Problem?
Reading Avro record takes long time.
– 52 micro-second/record
Found the hotspot using VisualVm
16
- 17. ©2013 LinkedIn Corporation. All Rights Reserved.
Improvement #1
Reduce the number of Schema.equals() calls
Schema equality checks required primarily for evolved
schema.
Solution includes caching to avoid unnecessary
expensive calls
Results
– Trunk read overhead : 52 μs/record
– After this patch read overhead : 32 μs/record
17
- 18. ©2013 LinkedIn Corporation. All Rights Reserved.
Improvement #2
Reduce extra data transformations
Solution is to provide custom object inspectors
Results
– Current read overhead : 52 μs/record
– After this patch read overhead : 30 μs/record
18
- 19. ©2013 LinkedIn Corporation. All Rights Reserved.
Final Results
19
55
32
30
11
0
10
20
30
40
50
60
Trunk Improvement #1 Improvement #2 Combined
- 21. ©2013 LinkedIn Corporation. All Rights Reserved.
56%Never Used Hive
44%Use Hive
27%Primarily use Hive
Out of all our Hadoop users:
Hive User Base at LinkedIn
21
of Hive jobs were from ad-hoc queries32%
- 22. ©2013 LinkedIn Corporation. All Rights Reserved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Who uses Hive and who doesn’t
22
Data Scientists
Engineers
Product Managers
Customer Support Specialists
Analysts
Hive adoption among Hadoop users by job title
- 23. ©2013 LinkedIn Corporation. All Rights Reserved.
Top concerns about Hive
23
Not friendly for long/complex workflows
Performance, especially for ad-hoc queries
Steep learning curve for tuning
Data/UDFs unavailability
Editor's Notes
- Hive -Adhoc and reporting , business analyticsPig – ETL pipeline, production WFsMR - Highly specialized application Az - LI WF
- Which processData operation can detect root causeEmail, http address
- Context of the problem