Hive at LinkedIn

•Download as PPTX, PDF•

1 like•1,100 views

mislam77

Hive efforts at Linkedin, Experiences of Hive-user. Presented by Mohammad islam, Mark Wagner, Karthik Ramasamy

Technology

©2013 LinkedIn Corporation. All Rights Reserved.
Hive at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved.
Agenda
 LinkedIn Data and its Ecosystem
 Performance Improvements – Avro
 User experiences
3

©2013 LinkedIn Corporation. All Rights Reserved.
Member Data
(Profiles)
Espresso
and RDBMS
External
Partner Data
Member Activity
(Page views,
button clicks)
Kafka Topics
Front-end
Serving
Systems
Member-facing
systems
Lots of cool stuff
not in this picture!
Where's the Data at LinkedIn?
© 2013 LinkedIn 24 June 2013
Data Ecosystem at LinkedIn
5
Member
Facing
Systems

©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
6
Member
Facing
Systems

©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
7
Member
Facing
Systems

©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
8
Member
Facing
Systems

©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
9
Member
Facing
Systems

©2013 LinkedIn Corporation. All Rights Reserved.
Data in Hadoop
 Almost all LinkedIn data is stored in Hadoop
 Tools used
– Hive/HCatalog
– Pig
– Java MapReduce
– Azkaban
10

©2013 LinkedIn Corporation. All Rights Reserved.
Hive Usage
 Use-cases
– Ad-hoc query
– Reporting
– Building Platforms
 Segmentation Engine
 Experimentations Engine
 Users
– Data Scientist
– Business Analytics
– Security team
– Product team
11

©2013 LinkedIn Corporation. All Rights Reserved.
Hive Challenges
 Performance
– Faster query execution
 Performance
– Faster query execution
 Efficient MR* execution plan
– Effective resource usage
– Ensure cluster stability
12

©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn Hive Initiatives
 Make HCatalog work and deploy [OnGoing]
 Hive Performance Improvement (Avro data reading) [On
Going]
 Stabilize Hive Server 2 at LI [About to Start]
 Expand the scope of HCatalog metadata [Planning]
13

©2013 LinkedIn Corporation. All Rights Reserved.
HCatalog Initiatives
 Expand scope of meta-data
– Who creates this data?
– What are the inputs?
 Helpful to create data lineage
– Who is the maintainer of data?
14

©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com

©2013 LinkedIn Corporation. All Rights Reserved.
What is the Problem?
 Reading Avro record takes long time.
– 52 micro-second/record
 Found the hotspot using VisualVm
16

©2013 LinkedIn Corporation. All Rights Reserved.
Improvement #1
 Reduce the number of Schema.equals() calls
 Schema equality checks required primarily for evolved
schema.
 Solution includes caching to avoid unnecessary
expensive calls
 Results
– Trunk read overhead : 52 μs/record
– After this patch read overhead : 32 μs/record
17

©2013 LinkedIn Corporation. All Rights Reserved.
Improvement #2
 Reduce extra data transformations
 Solution is to provide custom object inspectors
 Results
– Current read overhead : 52 μs/record
– After this patch read overhead : 30 μs/record
18

©2013 LinkedIn Corporation. All Rights Reserved.
Final Results
19
55
32
30
11
0
10
20
30
40
50
60
Trunk Improvement #1 Improvement #2 Combined

©2013 LinkedIn Corporation. All Rights Reserved.
56%Never Used Hive
44%Use Hive
27%Primarily use Hive
Out of all our Hadoop users:
Hive User Base at LinkedIn
21
of Hive jobs were from ad-hoc queries32%

©2013 LinkedIn Corporation. All Rights Reserved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Who uses Hive and who doesn’t
22
Data Scientists
Engineers
Product Managers
Customer Support Specialists
Analysts
Hive adoption among Hadoop users by job title

©2013 LinkedIn Corporation. All Rights Reserved.
Top concerns about Hive
23
Not friendly for long/complex workflows
Performance, especially for ad-hoc queries
Steep learning curve for tuning
Data/UDFs unavailability

Similar to Hive at LinkedIn

How Linkedin uses Automic for Big Data Processes

CA | Automic Software

Big Data Ecosystem @ LinkedIn

Minh-Hoang Nguyen

Two complementary trends are particularly strong in enterprise IT today: MongoDB itself, and the movement of infrastructure, platform, and software to as-a-service models. Being designed from the start to work in cloud deployments, MongoDB is a natural fit. Learn how your enterprise can create its own MongoDB service offering, combining the advantages of MongoDB and cloud for agile, nearly-instantaneous deployments. Ease your operations workload by centralizing your points for enforcement, standardize best policies, and enable elastic scalability. We will provide you with an enterprise planning outline which incorporates needs and value for stakeholders across operations, development, and business. We will cover accounting, chargeback integration, and quantification of benefits to the enterprise (such as standardizing best practices, creating elastic architecture, and reducing database maintenance costs).

Webinar: Enterprise Trends for Database-as-a-Service

MongoDB

Big data arch_analytics

Srinu Adira

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

Bhaskar Ghosh

LinkedIn has several data driven products that improve the experience of its users -- whether they are professionals or enterprises. Supporting this is a large ecosystem of systems and processes that provide data and insights in a timely manner to the products that are driven by it. This talk provides an overview of the various components of this ecosystem which are: - Hadoop - Teradata - Kafka - Databus - Camus - Lumos etc.

The Big Data Analytics Ecosystem at LinkedIn

rajappaiyer

[TDC 2013] Integre um grid de dados em memória na sua Arquitetura

Fernando Galdino

LinkedIn Infrastructure (analytics@webscale, at fb 2013)

Jun Rao

Innovation World 2013. The latest innovations in the world of webMethods. Learn more about the new webMethods offerings around the new architectural underpinnings of Event-Driven Architecture (EDA), Intelligent Business Operations (IBO), & Social and Mobile BPM. Get insights into the strategic vision and roadmap for the webMethods platform. Speakers: Brian Chan - VP, Global Information Systems, Avnet Shiva Kolli - Director Application Development, Discovery Communications Chen Wang - Head of Financial Markets Integration, Standard Chartered Bank Guillaume Hatt - Senior Program Manager/eDMS & Paperless Program Manager, Alcatel-Lucent Subhash Ramachandran - SVP, webMethods Product Management, Software AG Mark Herring - SVP, webMethods Product Marketing, Software AG Rob Tiberio – Chief Architect, webMethods R&D, Software AG Pete Carlson - VP, webMethods R&D, Software AG Hans-Christoph Rohland - SVP, webMethods R&D, Software AG

webMethods World: How Can You Innovate Even Faster With the Latest webMethods...

Software AG

Microservices product development blueprint

Kyle Sandburg

Simplifying Real-Time Architectures for IoT with Apache Kudu

Cloudera, Inc.

Distributed deep learning reference architecture v3.2l

Ganesan Narayanasamy

Kamanja: Driving Business Value through Real-Time Decisioning Solutions

Greg Makowski

The Changing Role of a DBA in an Autonomous World

Maria Colgan

Linked in stream experimentation framework

Joseph Adler

Motadata is a unified IT Infrastructure Monitoring, Log & Flow Management and IT Service Management Platform, offering operational insights into your IT infrastructure and its performance and is designed to identify & resolve complex problems faster that ensures 100% uptime of all business critical components. Motadata enables you to make more informed business decisions by offering complete visibility into the health and key performance indicators (KPIs) of IT services. It helps in reducing CAPEX, offers Agility to resolve issues faster, is compatible in a hybrid ecosystem, and offers ease of integration with existing and future platforms. In summary, with Motadata, Mindarray Systems offers the perfect solution needed to confidently handle the challenges of today’s increasingly complex business operations and IT infrastructure management. For more information: nov.sela@gmail.com

Motadata - Unified Product Suite for IT Operations and Big Data Analytics

novsela

Open Source, The Natural Fit for Content Management in the Enterprise

Matt Hamilton

JavaOne2013 Leveraging Linked Data and OSLC

Steve Speicher

Slides from my talk at IEEE BigData 2013 presenting our paper "Hourglass: a Library for Incremental Processing on Hadoop" Abstract: Hadoop enables processing of large data sets through its relatively easy-to-use semantics. However, jobs are often written inefficiently for tasks that could be computed incrementally due to the burdensome incremental state management for the programmer. This paper introduces Hourglass, a library for developing incremental monoid computations on Hadoop. It runs on unmodified Hadoop and provides an accumulator-based interface for programmers to store and use state across successive runs; the framework ensures that only the necessary subcomputations are performed. It is successfully used at LinkedIn, one of the largest online social networks, for many use cases in dashboarding and machine learning. Hourglass is open source and freely available.

Hourglass: a Library for Incremental Processing on Hadoop

Matthew Hayes

Enterprise Metadata Integration, Cloudera

Neo4j

Recently uploaded

Advantages of Hiring UIUX Design Service Providers for Your Business

Pixlogix Infotech

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

Handwritten Text Recognition for manuscripts and early printed texts

Maria Levchenko

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

Real Time Object Detection Using Open CV

Khem

Developing An App To Navigate The Roads of Brazil

V3cube

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

What are drone anti-jamming systems? The drone anti-jamming systems and anti-spoof technology protect against interference, jamming, and spoofing of the UAVs. To protect their security, countries are beginning to research drone anti-jamming systems, also known as drone strike weapons. The anti-jam and anti-spoof technology protects against interference, jamming and spoofing. A drone strike weapon is a drone attack weapon that can attack and destroy enemy drones. So what is so unique about this amazing system?

What Are The Drone Anti-jamming Systems Technology?

Antenna Manufacturer Coco

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Recently uploaded (20)

Advantages of Hiring UIUX Design Service Providers for Your Business

Axa Assurance Maroc - Insurer Innovation Award 2024

GenAI Risks & Security Meetup 01052024.pdf

Partners Life - Insurer Innovation Award 2024

Handwritten Text Recognition for manuscripts and early printed texts

Powerful Google developer tools for immediate impact! (2023-24 C)

Boost Fertility New Invention Ups Success Rates.pdf

A Year of the Servo Reboot: Where Are We Now?

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Real Time Object Detection Using Open CV

Developing An App To Navigate The Roads of Brazil

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Artificial Intelligence: Facts and Myths

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

How to Troubleshoot Apps for the Modern Connected Worker

What Are The Drone Anti-jamming Systems Technology?

How to Troubleshoot Apps for the Modern Connected Worker

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Boost PC performance: How more available memory can improve productivity

Hive at LinkedIn

4. ©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn Data Sources  Event Data – Page Views – Clicks – Search queries  Database Data – Profile (Users & Companies) – Connections  External Data – Salesforce, DoubleClick 4

5. ©2013 LinkedIn Corporation. All Rights Reserved. Member Data (Profiles) Espresso and RDBMS External Partner Data Member Activity (Page views, button clicks) Kafka Topics Front-end Serving Systems Member-facing systems Lots of cool stuff not in this picture! Where's the Data at LinkedIn? © 2013 LinkedIn 24 June 2013 Data Ecosystem at LinkedIn 5 Member Facing Systems

11. ©2013 LinkedIn Corporation. All Rights Reserved. Hive Usage  Use-cases – Ad-hoc query – Reporting – Building Platforms  Segmentation Engine  Experimentations Engine  Users – Data Scientist – Business Analytics – Security team – Product team 11

12. ©2013 LinkedIn Corporation. All Rights Reserved. Hive Challenges  Performance – Faster query execution  Performance – Faster query execution  Efficient MR* execution plan – Effective resource usage – Ensure cluster stability 12

13. ©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn Hive Initiatives  Make HCatalog work and deploy [OnGoing]  Hive Performance Improvement (Avro data reading) [On Going]  Stabilize Hive Server 2 at LI [About to Start]  Expand the scope of HCatalog metadata [Planning] 13

14. ©2013 LinkedIn Corporation. All Rights Reserved. HCatalog Initiatives  Expand scope of meta-data – Who creates this data? – What are the inputs?  Helpful to create data lineage – Who is the maintainer of data? 14

17. ©2013 LinkedIn Corporation. All Rights Reserved. Improvement #1  Reduce the number of Schema.equals() calls  Schema equality checks required primarily for evolved schema.  Solution includes caching to avoid unnecessary expensive calls  Results – Trunk read overhead : 52 μs/record – After this patch read overhead : 32 μs/record 17

18. ©2013 LinkedIn Corporation. All Rights Reserved. Improvement #2  Reduce extra data transformations  Solution is to provide custom object inspectors  Results – Current read overhead : 52 μs/record – After this patch read overhead : 30 μs/record 18

21. ©2013 LinkedIn Corporation. All Rights Reserved. 56%Never Used Hive 44%Use Hive 27%Primarily use Hive Out of all our Hadoop users: Hive User Base at LinkedIn 21 of Hive jobs were from ad-hoc queries32%

22. ©2013 LinkedIn Corporation. All Rights Reserved. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Who uses Hive and who doesn’t 22 Data Scientists Engineers Product Managers Customer Support Specialists Analysts Hive adoption among Hadoop users by job title

23. ©2013 LinkedIn Corporation. All Rights Reserved. Top concerns about Hive 23 Not friendly for long/complex workflows Performance, especially for ad-hoc queries Steep learning curve for tuning Data/UDFs unavailability

Editor's Notes

Hive -Adhoc and reporting , business analyticsPig – ETL pipeline, production WFsMR - Highly specialized application Az - LI WF
Which processData operation can detect root causeEmail, http address
Context of the problem

Hive at LinkedIn

Recommended

Recommended

More Related Content

Similar to Hive at LinkedIn

Similar to Hive at LinkedIn (20)

Recently uploaded

Recently uploaded (20)

Hive at LinkedIn

Editor's Notes