SlideShare uma empresa Scribd logo
1 de 20
Crossing the ChasmHadoop for the Enterprise Sanjay Radia – Hortonworks Founder & Architect Formerly Hadoop Architect @ Yahoo! 4 Years @ Yahoo!  @srr (@hortonworks) © Hortonworks Inc. 2011 June 29, 2011
Crossing the Chasm Geoffrey A. Moore Apache Hadoop grew rapidly charting new territories in features, abstractions, APIs, scale, fault tolerance, multi-tenancy, operations … Small number of early customers who needed a new platform Provide Hadoop as a service to make adoption easy Today: Dramatic growth in adoption and customer base Growth of Hadoop stack and applications New requirements and expectations Mission Critical Late Majority Early Majority Early Adopters 3
Crossing the Chasm: Overview How the Chasm is being crossed Security SLAs & Predictability Scalability Availability & Data Integrity Backward Compatibility Quality & Testing Fundamental architectural improvements  Federation & MR.next Adapt to changing, sometime unforeseen, needs Fuel innovation and rapid development The Community Effect 4
Security Early Gains No authorization or authentication requirements Added permissions and passed client-side userid to server (0.16) Addresses accidental deletes by another user Service Authorization (0.18, 0.20) Issues: Stronger Authorization required Shared clusters – multiple tenants Critical data New categories of users (financial)  SOX compliance Our Response Authentication using Kerberos (0.20.203) 10 Person-year effort by Yahoo! 5
SLAs and Predictability Issue: Customers uncomfortable with shared clusters Customer traditionally plan for peaks with dedicated HW Dedicated clusters had poor utilization Response: Capacity scheduler (0.20) Guaranteed capacities in a multi-tenant shared cluster Almost like dedicated hardware Each organization given queue(s) with a guaranteed capacity controls who is allowed to submit jobs to their queues sets the priorities of jobs within their queue creates sub-queues (0.21) for finer grain control within their capacity Unused capacity given to tasks in other queues Better than private cluster –access to unused capacity when in crunch Resource limits for tasks – deals with misbehaved apps Response: FairShare Scheduler (0.20) Focus is fair share of resources, but does have pools 6
Scalability Early Gains Simple design allowed rapid improvements Single master, namespace in RAM, simpler locking Cluster size improvements: 1K  2K  4K Vertical scaling: Tuned GC + Efficient memory usage Archive file system – reduce files and blocks (0.20) Current Issues Growth of files and storage limited by single NN (0.20) Only an issue for very very large clusters JobTracker does not scale to beyond 30K tasks – needs redesign Our Response RW locks in NN (0.22) MR.next– complete rewrite of MR servers (JT, TT) -  100K tasks (0.23) Federation: horizontal scaling of namespace – billion files (0.23) NN that keeps only part of Namespace in memory –trillion files (0.23.x) 7
HDFS Availability & Data Integrity:Early Gains Simple design, Java, storage fault tolerance Java – saved from pointer errors that lead to data corruption Simplicity - subset of Posix – random writers not supported Storage: Rely in OS’s file system rather than use raw disk Storage Fault Tolerance: multiple replicas, active monitoring Single Namenode Master Persistent state:  multiple copies  + checkpoints Restart on failure How well did it work? Lost 650 blocks out of 329 M on 10 clusters with 20K nodes in 2009 82% abandoned open file (append bug, fixed in 0.21) 15% files created with single replica (data reliability not needed)  3% due to roughly 7 bugs that were then fixed (0.21) Over the last 18 months 22 failures on 25 clusters Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year) NN is very robust and can take a lot of abuse NN is resilient against overload caused by misbehaving apps 8
HDFS Availability & Data Integrity:Response Data Integrity Append/flush/sync redesign (0.21) Pipeline recruits new replicas rather than just remove them on failures (0.23) Improving Availability of NN Faster HDFS restarts NN bounce in 20 minutes (0.23) Federation allows smaller NNs (0.23) Federation will significantly improve NN isolation hence availability (0.23) Why did we wait this long for HA NN? The failure rates did not demand making this a high priority Failover requires corner cases to be correctly addressed Correct fencing of shared state during failover is critical Can lead to corruption of data and reduceavailability!! Many factors impact availability, not just failover 9
HDFS Availability & Data Integrity:Response: HA NN Active work has started on HA NN (Failover) HA NN – Detailed design (HDFS-1623) Community effort HDFS-1971, 1972, 1973, 1974,1975, 2005, 2064, 1073 HA: Prototype work Backup NN (0.21) Avatar NN (Facebook) HA NN prototype using Linux HA (Yahoo!) HA NN prototype with Backup NN and block report replicator (EBay) HA the highest priority for 23.x 10
MapReduce: Fault Tolerance and Availability Early Gains: Fault-tolerance of tasks and compute nodes Current Issues:Loss of job queue if Job tracker is restarted Our Response MR.next designed with fault tolerance and availability HA Resource Manager (0.23.x) Loss of Resource Manager – degraded mode - recover via restart or failover Apps continue with their current resources App Manager can reschedule with current resources New apps cannot submitted or launched, New resources cannot be allocated Loss of an App Manager - recovers App is restarted and state is recovered Loss of tasks and nodes - recovers Recovered as in old MapReduce 11
Backwards Compatibility Early Gains Early success stemmed from a philosophy of ship early and often, resulting in changing APIs. Data and metadata compatibility always maintained The early customers paid the price current customers reap benefits of more mature interfaces Issues Increased adoption leads to increased expectations of backwards compatibility 12
Backward Compatibility:Response Interface classification - audience and stability tags (0.21) Patterned on enterprise-quality software process Evolve interfaces but maintain backward compatible Added newer forward looking interfaces - old interface maintained  Test for compatibility Run old jars of automation  tests, Real Yahoo applications Applications adopting higher abstractions (Pig, Hive) Insulates from lower primitive interfaces Wire compatibility (Hadoop-7347) Maintain compatibility with current protocol (java serialization) Adapters for addressing future discontinuity  e.g. serializationor protocol change Moved to ProtocolBuf for data transfer protocol 13
Testing & Quality Nightly Testing Against 1200 automated tests on 30 nodes Against live data and live applications QE Certification for Release Large variety  and scale tests on 500 nodes Performance benchmarking QE HIT integration testing of whole stack Release Testing Sandbox cluster – 3 clusters each with 400 -1K nodes Major releases:  2 months testing on actual data -  all production projects must sign off Research clusters – 6 Clusters  (non-revenue production jobs) (4K Nodes) Major releases – minimum 2 months before moving to production .25Million to .5Million jobs per week  if it clears research then mostly fine in fine in production Release Production clusters - 11 clusters (4.5K nodes) Revenue generating, stricter SLAs 14
Fundamental Architecture Changes that cut across several issues Coupled One-to-One Job Manager Resource Scheduler Storage Resources Compute Resources MapReduce HDFS Namesystem HDFS storage: mostly a separate layer – but one customer: one NN Federation generalizes the layer MapReduce – compute resource scheduling  tightly coupled to MapReduce job management  MapReduce.next  separates the layers 15
HBase Fundamental Architecture Changes that cut across several issues Resource Scheduler Layered One-to-Many HDFS Namesystem HDFS Namesystem Alternate NN Implementation MR App with Different version MR lib Storage Resources HDFS Namesystem HDFS Namesystem MR tmp MPI App HDFS Namesystem MR App Compute Resources Scalability, Isolation, Availability Generic lower layer:  first class support new applications on top MR tmp, HBase, MPI,  Layering facilitates faster development of new work NN that caches Namespace – a few months of work New implementations of MR App manager Compatibility: Support multiple versions of MR Tenants upgrade at their own pace – crucial for shared clusters 16
The Community Effect Some projects are done entirely by teams at  Yahoo!, FB or Cloudera But several projects are joint work Yahoo & FB on NN scalability and concurrency esp in face of misbehaved apps Edits log v2 and  refactoring edits log (Cloudera and Yahoo!/Hortonworks) HDFS-1073, 2003, 1557, 1926 NN HA – Yahoo!/Hortonworks, Cloudera, FB, EBay HDFS-1623, 1971, 1972, 1973, 1974, ,1975, 2005 Features to support HBase: FB, Cloudera, Yahoo, and the HBase community Expect to see rapid improvements in the very near future Further Scalability - NN that cache part of namespace Improved IO Performance - DN performance improvements Wire Compatibility - Wire protocols, operational improvements,  New App Managers for MR.next Continued improvement of management and operability 17
Hadoop is Successfully Crossing the Chasm Hadoop used in enterprises for revenue generating applications		 Apache Hadoop is improving at a rapid rate Addressing many issues including HA Fundamental design improvements to fuel innovation The might of a large growing developer community Battle tested on large clusters and variety of applications At Yahoo!, Facebook and the many other Hadoop customers. Data integrity has been a focus from the early days A level of testing that even the large commercial vendors cannot match! Can you trust your data to anything less? 18
Q & A Hortonworks @ Hadoop Summit 1:45pm: Next Generation Apache Hadoop MapReduce Community track by Arun Murthy 2:15pm: Introducing HCatalog (Hadoop Table Manager) Community track by Alan Gates 4:00pm: Large Scale Math with Hadoop MapReduce Applications and Research Track by Tsz-Wo Sze 4:30pm: HDFS Federation and Other Features Community track by Suresh Srinivas and Sanjay Radia 19
About Hortonworks Mission: Revolutionize and commoditize the storage and processing of big data via open source Vision: Half of the world’s data will be stored in Apache Hadoop within five years Strategy: Drive advancements that make Apache Hadoop projects more consumable for the community, enterprises and ecosystem Make Apache Hadoop easy to install, manage and use  Improve Apache Hadoop performance and availability Make Apache Hadoop easy to integrate and extend  © Hortonworks Inc. 2011 20
Thank You. © Hortonworks Inc. 2011

Mais conteúdo relacionado

Mais procurados

Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureDataWorks Summit
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics OptimizationHortonworks
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHortonworks
 
Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters
Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters
Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters Hortonworks
 
Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Hortonworks
 
Apache Hadoop 0.23
Apache Hadoop 0.23Apache Hadoop 0.23
Apache Hadoop 0.23Hortonworks
 
Scaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC IsilonScaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC IsilonHortonworks
 
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...DataWorks Summit
 
Apache Ambari - What's New in 2.2
 Apache Ambari - What's New in 2.2 Apache Ambari - What's New in 2.2
Apache Ambari - What's New in 2.2Hortonworks
 
Apache NiFi Toronto Meetup
Apache NiFi Toronto MeetupApache NiFi Toronto Meetup
Apache NiFi Toronto MeetupHortonworks
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghaiYifeng Jiang
 
Accelerating query processing
Accelerating query processingAccelerating query processing
Accelerating query processingDataWorks Summit
 
Hortonworks Presentation at Big Data London
Hortonworks Presentation at Big Data LondonHortonworks Presentation at Big Data London
Hortonworks Presentation at Big Data LondonHortonworks
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 

Mais procurados (20)

Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters
Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters
Discover HDP 2.1: Using Apache Ambari to Manage Hadoop Clusters
 
Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016
 
Apache Hadoop 0.23
Apache Hadoop 0.23Apache Hadoop 0.23
Apache Hadoop 0.23
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Scaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC IsilonScaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC Isilon
 
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
 
Apache Ambari - What's New in 2.2
 Apache Ambari - What's New in 2.2 Apache Ambari - What's New in 2.2
Apache Ambari - What's New in 2.2
 
Apache NiFi Toronto Meetup
Apache NiFi Toronto MeetupApache NiFi Toronto Meetup
Apache NiFi Toronto Meetup
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 
Accelerating query processing
Accelerating query processingAccelerating query processing
Accelerating query processing
 
Hortonworks Presentation at Big Data London
Hortonworks Presentation at Big Data LondonHortonworks Presentation at Big Data London
Hortonworks Presentation at Big Data London
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 

Destaque

Destaque (20)

Seneca, Pittsburgh Supercomputer, and LSI
Seneca, Pittsburgh Supercomputer, and LSI Seneca, Pittsburgh Supercomputer, and LSI
Seneca, Pittsburgh Supercomputer, and LSI
 
AR - Applying Big Data to Risk Management
AR - Applying Big Data to Risk ManagementAR - Applying Big Data to Risk Management
AR - Applying Big Data to Risk Management
 
El juego en atencion temprana
El juego en atencion tempranaEl juego en atencion temprana
El juego en atencion temprana
 
Publicidade On line
Publicidade On linePublicidade On line
Publicidade On line
 
Visanet
VisanetVisanet
Visanet
 
Ge capital conf bologna [read only]
Ge capital conf bologna [read only]Ge capital conf bologna [read only]
Ge capital conf bologna [read only]
 
Contabilidad Pdf
Contabilidad PdfContabilidad Pdf
Contabilidad Pdf
 
Pompe sommerse Flygt - Fornid
Pompe sommerse Flygt - FornidPompe sommerse Flygt - Fornid
Pompe sommerse Flygt - Fornid
 
Lobotzke Oz Clinic
Lobotzke Oz ClinicLobotzke Oz Clinic
Lobotzke Oz Clinic
 
2013 43 rassegna normativa
2013 43 rassegna normativa2013 43 rassegna normativa
2013 43 rassegna normativa
 
Curitiba
CuritibaCuritiba
Curitiba
 
Vocab Tanyas Reunion
Vocab Tanyas ReunionVocab Tanyas Reunion
Vocab Tanyas Reunion
 
Dossier vf
Dossier vfDossier vf
Dossier vf
 
Software As A Service
Software As A ServiceSoftware As A Service
Software As A Service
 
Chicas En Moda
Chicas En ModaChicas En Moda
Chicas En Moda
 
Edital concurso caema,são luis do maranhão
Edital concurso caema,são luis do maranhãoEdital concurso caema,são luis do maranhão
Edital concurso caema,são luis do maranhão
 
Informe Diario MAE 04-01-13
Informe Diario MAE 04-01-13Informe Diario MAE 04-01-13
Informe Diario MAE 04-01-13
 
PRUEBAS ICFES SABER 11
PRUEBAS ICFES SABER 11PRUEBAS ICFES SABER 11
PRUEBAS ICFES SABER 11
 
Webdesign para blogs literários
Webdesign para blogs literáriosWebdesign para blogs literários
Webdesign para blogs literários
 
Ma
MaMa
Ma
 

Semelhante a Crossing the Chasm

Containerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesContainerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesDataWorks Summit
 
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYApache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYWangda Tan
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Chris Nauroth
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Big Data Joe™ Rossi
 
Chicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseChicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseCloudera, Inc.
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt Ceph Community
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierDemai Ni
 

Semelhante a Crossing the Chasm (20)

Containerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesContainerized Hadoop beyond Kubernetes
Containerized Hadoop beyond Kubernetes
 
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYApache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1
 
Chicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseChicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBase
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
1.0 vs2.0
1.0 vs2.01.0 vs2.0
1.0 vs2.0
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 

Mais de Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Mais de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Último (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Crossing the Chasm

  • 1. Crossing the ChasmHadoop for the Enterprise Sanjay Radia – Hortonworks Founder & Architect Formerly Hadoop Architect @ Yahoo! 4 Years @ Yahoo! @srr (@hortonworks) © Hortonworks Inc. 2011 June 29, 2011
  • 2. Crossing the Chasm Geoffrey A. Moore Apache Hadoop grew rapidly charting new territories in features, abstractions, APIs, scale, fault tolerance, multi-tenancy, operations … Small number of early customers who needed a new platform Provide Hadoop as a service to make adoption easy Today: Dramatic growth in adoption and customer base Growth of Hadoop stack and applications New requirements and expectations Mission Critical Late Majority Early Majority Early Adopters 3
  • 3. Crossing the Chasm: Overview How the Chasm is being crossed Security SLAs & Predictability Scalability Availability & Data Integrity Backward Compatibility Quality & Testing Fundamental architectural improvements Federation & MR.next Adapt to changing, sometime unforeseen, needs Fuel innovation and rapid development The Community Effect 4
  • 4. Security Early Gains No authorization or authentication requirements Added permissions and passed client-side userid to server (0.16) Addresses accidental deletes by another user Service Authorization (0.18, 0.20) Issues: Stronger Authorization required Shared clusters – multiple tenants Critical data New categories of users (financial) SOX compliance Our Response Authentication using Kerberos (0.20.203) 10 Person-year effort by Yahoo! 5
  • 5. SLAs and Predictability Issue: Customers uncomfortable with shared clusters Customer traditionally plan for peaks with dedicated HW Dedicated clusters had poor utilization Response: Capacity scheduler (0.20) Guaranteed capacities in a multi-tenant shared cluster Almost like dedicated hardware Each organization given queue(s) with a guaranteed capacity controls who is allowed to submit jobs to their queues sets the priorities of jobs within their queue creates sub-queues (0.21) for finer grain control within their capacity Unused capacity given to tasks in other queues Better than private cluster –access to unused capacity when in crunch Resource limits for tasks – deals with misbehaved apps Response: FairShare Scheduler (0.20) Focus is fair share of resources, but does have pools 6
  • 6. Scalability Early Gains Simple design allowed rapid improvements Single master, namespace in RAM, simpler locking Cluster size improvements: 1K  2K  4K Vertical scaling: Tuned GC + Efficient memory usage Archive file system – reduce files and blocks (0.20) Current Issues Growth of files and storage limited by single NN (0.20) Only an issue for very very large clusters JobTracker does not scale to beyond 30K tasks – needs redesign Our Response RW locks in NN (0.22) MR.next– complete rewrite of MR servers (JT, TT) - 100K tasks (0.23) Federation: horizontal scaling of namespace – billion files (0.23) NN that keeps only part of Namespace in memory –trillion files (0.23.x) 7
  • 7. HDFS Availability & Data Integrity:Early Gains Simple design, Java, storage fault tolerance Java – saved from pointer errors that lead to data corruption Simplicity - subset of Posix – random writers not supported Storage: Rely in OS’s file system rather than use raw disk Storage Fault Tolerance: multiple replicas, active monitoring Single Namenode Master Persistent state: multiple copies + checkpoints Restart on failure How well did it work? Lost 650 blocks out of 329 M on 10 clusters with 20K nodes in 2009 82% abandoned open file (append bug, fixed in 0.21) 15% files created with single replica (data reliability not needed) 3% due to roughly 7 bugs that were then fixed (0.21) Over the last 18 months 22 failures on 25 clusters Only 8 would have benefitted from HA failover!! (0.23 failures per cluster year) NN is very robust and can take a lot of abuse NN is resilient against overload caused by misbehaving apps 8
  • 8. HDFS Availability & Data Integrity:Response Data Integrity Append/flush/sync redesign (0.21) Pipeline recruits new replicas rather than just remove them on failures (0.23) Improving Availability of NN Faster HDFS restarts NN bounce in 20 minutes (0.23) Federation allows smaller NNs (0.23) Federation will significantly improve NN isolation hence availability (0.23) Why did we wait this long for HA NN? The failure rates did not demand making this a high priority Failover requires corner cases to be correctly addressed Correct fencing of shared state during failover is critical Can lead to corruption of data and reduceavailability!! Many factors impact availability, not just failover 9
  • 9. HDFS Availability & Data Integrity:Response: HA NN Active work has started on HA NN (Failover) HA NN – Detailed design (HDFS-1623) Community effort HDFS-1971, 1972, 1973, 1974,1975, 2005, 2064, 1073 HA: Prototype work Backup NN (0.21) Avatar NN (Facebook) HA NN prototype using Linux HA (Yahoo!) HA NN prototype with Backup NN and block report replicator (EBay) HA the highest priority for 23.x 10
  • 10. MapReduce: Fault Tolerance and Availability Early Gains: Fault-tolerance of tasks and compute nodes Current Issues:Loss of job queue if Job tracker is restarted Our Response MR.next designed with fault tolerance and availability HA Resource Manager (0.23.x) Loss of Resource Manager – degraded mode - recover via restart or failover Apps continue with their current resources App Manager can reschedule with current resources New apps cannot submitted or launched, New resources cannot be allocated Loss of an App Manager - recovers App is restarted and state is recovered Loss of tasks and nodes - recovers Recovered as in old MapReduce 11
  • 11. Backwards Compatibility Early Gains Early success stemmed from a philosophy of ship early and often, resulting in changing APIs. Data and metadata compatibility always maintained The early customers paid the price current customers reap benefits of more mature interfaces Issues Increased adoption leads to increased expectations of backwards compatibility 12
  • 12. Backward Compatibility:Response Interface classification - audience and stability tags (0.21) Patterned on enterprise-quality software process Evolve interfaces but maintain backward compatible Added newer forward looking interfaces - old interface maintained Test for compatibility Run old jars of automation tests, Real Yahoo applications Applications adopting higher abstractions (Pig, Hive) Insulates from lower primitive interfaces Wire compatibility (Hadoop-7347) Maintain compatibility with current protocol (java serialization) Adapters for addressing future discontinuity e.g. serializationor protocol change Moved to ProtocolBuf for data transfer protocol 13
  • 13. Testing & Quality Nightly Testing Against 1200 automated tests on 30 nodes Against live data and live applications QE Certification for Release Large variety and scale tests on 500 nodes Performance benchmarking QE HIT integration testing of whole stack Release Testing Sandbox cluster – 3 clusters each with 400 -1K nodes Major releases: 2 months testing on actual data - all production projects must sign off Research clusters – 6 Clusters (non-revenue production jobs) (4K Nodes) Major releases – minimum 2 months before moving to production .25Million to .5Million jobs per week if it clears research then mostly fine in fine in production Release Production clusters - 11 clusters (4.5K nodes) Revenue generating, stricter SLAs 14
  • 14. Fundamental Architecture Changes that cut across several issues Coupled One-to-One Job Manager Resource Scheduler Storage Resources Compute Resources MapReduce HDFS Namesystem HDFS storage: mostly a separate layer – but one customer: one NN Federation generalizes the layer MapReduce – compute resource scheduling tightly coupled to MapReduce job management MapReduce.next separates the layers 15
  • 15. HBase Fundamental Architecture Changes that cut across several issues Resource Scheduler Layered One-to-Many HDFS Namesystem HDFS Namesystem Alternate NN Implementation MR App with Different version MR lib Storage Resources HDFS Namesystem HDFS Namesystem MR tmp MPI App HDFS Namesystem MR App Compute Resources Scalability, Isolation, Availability Generic lower layer: first class support new applications on top MR tmp, HBase, MPI, Layering facilitates faster development of new work NN that caches Namespace – a few months of work New implementations of MR App manager Compatibility: Support multiple versions of MR Tenants upgrade at their own pace – crucial for shared clusters 16
  • 16. The Community Effect Some projects are done entirely by teams at Yahoo!, FB or Cloudera But several projects are joint work Yahoo & FB on NN scalability and concurrency esp in face of misbehaved apps Edits log v2 and refactoring edits log (Cloudera and Yahoo!/Hortonworks) HDFS-1073, 2003, 1557, 1926 NN HA – Yahoo!/Hortonworks, Cloudera, FB, EBay HDFS-1623, 1971, 1972, 1973, 1974, ,1975, 2005 Features to support HBase: FB, Cloudera, Yahoo, and the HBase community Expect to see rapid improvements in the very near future Further Scalability - NN that cache part of namespace Improved IO Performance - DN performance improvements Wire Compatibility - Wire protocols, operational improvements, New App Managers for MR.next Continued improvement of management and operability 17
  • 17. Hadoop is Successfully Crossing the Chasm Hadoop used in enterprises for revenue generating applications Apache Hadoop is improving at a rapid rate Addressing many issues including HA Fundamental design improvements to fuel innovation The might of a large growing developer community Battle tested on large clusters and variety of applications At Yahoo!, Facebook and the many other Hadoop customers. Data integrity has been a focus from the early days A level of testing that even the large commercial vendors cannot match! Can you trust your data to anything less? 18
  • 18. Q & A Hortonworks @ Hadoop Summit 1:45pm: Next Generation Apache Hadoop MapReduce Community track by Arun Murthy 2:15pm: Introducing HCatalog (Hadoop Table Manager) Community track by Alan Gates 4:00pm: Large Scale Math with Hadoop MapReduce Applications and Research Track by Tsz-Wo Sze 4:30pm: HDFS Federation and Other Features Community track by Suresh Srinivas and Sanjay Radia 19
  • 19. About Hortonworks Mission: Revolutionize and commoditize the storage and processing of big data via open source Vision: Half of the world’s data will be stored in Apache Hadoop within five years Strategy: Drive advancements that make Apache Hadoop projects more consumable for the community, enterprises and ecosystem Make Apache Hadoop easy to install, manage and use Improve Apache Hadoop performance and availability Make Apache Hadoop easy to integrate and extend © Hortonworks Inc. 2011 20
  • 20. Thank You. © Hortonworks Inc. 2011

Notas do Editor

  1. Sanjay Radia –been on Hadoop project at Yahoo for last 4 yearsThese 4 years have been a blastVery proud to be a Y employee - Yahoo’s leadership has made Hadoop the success it is today(Forward thinking in Spinning out Hortonworks.)I am very exited to be part of the team at Hortonworks and continuing to grow Apache Hadoop as an open-source project
  2. Early customers at YahooA key idea was to provide Hadoop as a service – otherwise, adoption would not have been so rapid.Customer starts focusing his application on hadoop rather than figure out how to get the capex and convince IT to deploy hadoop
  3. Some early decisions, why and how well they workedChoices/tradeoffs based on immediate customer needs, time, resources, operational needsGoal:Gain acceptance, evolve and grow customer baseFundamental architectural improvements that cuts across various issuesConclude with the community effect
  4. Stronger security driven by multiple tenants in shared clustersSecurity will be critical for new enterprise users especially those with financial data10 Person year effort by completely by Yahoo! and a major milestone for Apache Hadoop
  5. Early customers happy with a new platform that let them solve problems they could not solve otherwiseBut many new customers …Unused capacity – esp effective if applications peaks at different timesMix production and non production jobsFairness not a goal –customers are guaranteed capacity they have paid forNN under federation also provides isolation
  6. scaling is enough to meet needs of very largecustomers – like FB and Yother customers should be more than okay Before you need more namespace, it will be therea NN that stores only partial namespace in memory
  7. Data – can I read what I wrote, is the service availableWhen I asked one of the original authors of of GFS if there were any decisions they would revist – random writersSimplicity is keyRaw disk – fs take time to stabilize – we can take advantage of ext4, xfs or zfs
  8. Pipeline – useful for slow appender
  9. Certification once ready for releaseHDFS (over 92% coverage)balancer, block replication, corruption, fsck, security,. Cmd lines , viewfs, faults and injection, checkpointing, Federation testing – ready to move to release testingMRDist cache, capacity scheduler, speculative exec, block listing, decommissioning, ui testingFailures and injectionsScale testing – 4 TT per node.Performance– sort, scan, compression, gridmixv3, slive, ..HIT integration whole stack (hdfs, MR, Oozie, Pig, Hive, HCat,) Sandbox – Customer application validation - release vote for 23.0Research:Larger variety of jobs and load than production clusters
  10. So far I have been explaining how we have been addressing specific issuesBut we have made fundamental change to the architecture to both HDFS and MRFundamental changes that cuts across issues
  11. HDFS – the change was fairly simpleMR – redesign – allocation of compute resources a general layer separate Scheduler from App/Job Manager
  12. New Append, Security, Federation, MR.next – YahooRaid, High tide, FBHBase improvements – FB & Cloudera
  13. The fundamental design improvements will accelerate the rate of improvementIf there is feature that customers need – it will be provided quickly