Mais conteúdo relacionado

Apresentações para você(20)


Mais de DataWorks Summit(20)


HBase Global Indexing to support large-scale data ingestion at Uber

  1. HBase Global Indexing to support large-scale data ingestion @ Uber May 21, 2019
  2. Danny Chen ● Engineering Manager on Hadoop Data Platform team ● Leading Data Ingestion team ● Previous worked @ on storage team (Manhattan) ● Enjoy playing basketball, biking, and spending time w/my kids.
  3. Uber Apache Hadoop Platform Team Mission Build products to support reliable, scalable, easy-to-use, compliant, and efficient data transfer (both ingestion & dispersal) as well as data storage leveraging the Apache Hadoop ecosystem. Apache Hadoop is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of this mark.
  4. Overview ● High-Level Ingestion & Dispersal introduction ● Different types of workloads ● Need for Global Index ● How Global Index Works ● Generating Global Indexes with HFiles ● Throttling HBase Access ● Next Steps
  5. High Level Ingestion/Dispersal Introduction
  6. Hadoop Data Ecosystem at Uber Apache Hadoop Data Lake Schemaless Analytical Processing Apache Kafka, Cassandra, Spark, and HDFS logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks. Data Ingestion Data Dispersal
  7. Hadoop Data Ecosystem at Uber
  8. Different Types of Workloads
  9. Bootstrap ● One time only at beginning of lifecycle ● Large amounts of data ● Millions of QPS throughput ● Need to finish in a matter of hours ● NoSQL stores cannot keep up
  10. Incremental ● Dominates lifecycle of Hive table ingestion ● Incremental upstream changes from Kafka or other data sources. ● 1000’s QPS per dataset ● Reasonable throughput requirements for NoSQL stores
  11. Cell vs Row Changes
  12. Need for Global Index
  13. Requirements for Global Index ● Large amounts of historical data ingested in short amount of time ● Append only vs Append-plus-update ● Data layout and partitioning ● Bookkeeping for data layout ● Strong consistency ● High Throughput ● Horizontally scalable ● Required a NoSQL store
  14. ● Decision was to use HBase ● Trade Availability for Consistency ● Automatic Rebalancing of HBase tables via region splitting ● Global view of dataset via master/slave architecture VS
  15. How Global Index Works
  16. Generating Global Indexes
  17. Batch and One Time Index Upload
  18. Data Model For Global Index
  19. Spark & RDD Transformations for index generation
  20. HFile Upload Process
  21. HFile Index Job Tuning ● Explicitly register classes with Kryo Serialization ● Reduce 3 shuffle stages to one ● Proper HFile Size ● Proper Partition Counting Size ● 13 TB index data with 54 billion indexes ○ 2 hours to generate indexes ○ 10 min to load
  22. Throttling HBase Access
  23. The need for throttling HBase Access
  24. Horizontal Scalability & Throttling
  25. Next Steps
  26. Next Steps ● Handle non-append-only data during bootstrap ● Explore other indexing solutions
  27. Useful Links open-source/
  28. Other Dataworks Summit Talks Marmaray: Uber’s Open-sourced Generic Hadoop Data Ingestion and Dispersal Framework Wednesday at 11 am
  29. Attribution Kaushik Devarajaiah Nishith Agarwal Jing Li
  30. Positions available: Seattle, Palo Alto & San Francisco email : We are hiring!
  31. Thank you Proprietary © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed. All recipients of this document are notified that the information contained herein includes proprietary information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. Questions: email Follow our Facebook page:

Notas do Editor

  1. Lots of effort into making a completely self-serve onboarding process Analytical users will little technical knowledge of Spark, Hadoop, Hive etc will still be able to take advantage of our platform Our assertion is that when relevant data is discoverable in the appropriate data stores for either analytical purposes, there really can be a substantial gains in terms of efficiency and value for your business. Marmaray is critical for ensuring data is in the appropriate data store. Familiarity with suite of tools in our Hadoop Ecosystem for many potential use cases to extract insights out of raw data
  2. Completion of the Hadoop Ecosystem of tools at Uber and original vision of the Data Processing Platform Heatpipe/Watchtower produce quality schematized data Ingest the data via Marmaray Orchestrate jobs via Workflow Management System to run analytics and generate derived datasets, or build models using Michelangelo Disperse the data using Marmaray to stores with low latency semantics What sets it apart Generic ingestion framework Not tightly coupled to any source or sink Shouldn’t be coupled to a specific source or a specific sink (product teams focus on this)
  3. Dividing bootstrap and incremental allows us to choose a kv store where we that can scale for incremental phase indexing but not necessarily for bootstrapping of data.
  4. HBase automatically rebalances tables within a cluster by splitting up key ranges when a region gets too large. Can also load balance by having new regions moved to other servers The master-slave architecture enables getting a global view of the spread of a dataset across the cluster, which we utilize in customizing dataset specific throughputs to our HBase cluster.
  5. During incremental ingestion We work in mini batches. It is the job of work unit calculator to provide required level of throttling
  6. We work in mini batches. It is the job of work unit calculator to provide required level of throttling
  7. Our Big Data ecosystem’s model of indexes stored in HBase contains entities shown in green that help identify files that need to be updated corresponding to a given record in an append-plus-update dataset. The layout of index entries in HFiles lets us sort based on key value and column.
  8. This is for the one time upload case FlatMapToMair transformation in Apache Spark does not preserve the ordering of entries, so a partition isolated sort is performed. The partitioning is unchanged to ensure each partition still corresponds to a non-overlapping key range.
  9. HFiles are written to the cluster where HBase is hosted to ensure HBase region servers have access to them during the upload process. - Hfile upload by be severely affected by splitting - Presplit HBase table into as many regions as there are HFiles so each Hfile can fit within a regio - We avoid splitting Hfile based on Hfile size and it severely impacts Hfile upload time (10 min even for tens of TB) - Done by presplitting hbase table so each hfile fits within a seaparate HBase region with non overlapping keys
  10. HFiles are written to the cluster where HBase is hosted to ensure HBase region servers have access to them during the upload process.
  11. Three Apache Spark jobs corresponding to three different datasets access their respective HBase index table, creating loads on HBase regional servers hosting these tables.
  12. Adding more servers to the HBase cluster for a single dataset that is using global index linearly correlates with a QPS increase, although the dataset’s QPSFraction remains constant.
  13. Explore other indexing solutions to possibly merge bootstrap and incremental indexing solutions for easier maintenance.