SlideShare a Scribd company logo
1 of 35
Download to read offline
Extending the Enterprise Data Warehouse with Hadoop

                 Robert Lancaster and Jonathan Seidman
                                  Chicago Data Summit
                                         April 26 | 2011
Who We Are


•  Robert Lancaster
  –  Solutions Architect, Hotel Supply Team
  –  rlancaster@orbitz.com
  –  @rob1lancaster
•  Jonathan Seidman
  –  Lead Engineer, Business Intelligence/Big Data Team
  –  Co-founder/organizer of Chicago Hadoop User Group
     (http://www.meetup.com/Chicago-area-Hadoop-User-
     Group-CHUG)
  –  jseidman@orbitz.com
  –  @jseidman



                                                          page 2
Launched: 2001, Chicago, IL




                              page 3
Why are we using Hadoop?

 Stop me if you’ve heard this before…




                                        page 4
On Orbitz alone we do millions of searches and transactions
 daily, which leads to hundreds of gigabytes of log data
 every day.




                                                        page 5
Hadoop provides us with efficient, economical,
 scalable, and reliable storage and processing of
 these large amounts of data.


 $ per TB




                                                    page 6
And…


Hadoop places no constraints on how data is
 processed.




                                              page 7
Before Hadoop




                page 8
With Hadoop




              page 9
Access to this non-transactional data enables a number of
applications…




                                                            page 10
Optimizing Hotel Search




                          page 11
Recommendations




                  page 12
Page Performance Tracking




                            page 13
Cache Analysis
100.00%
                        72% of queries are                                                        Queries
                        singletons and make up
90.00%                                                                                            Searches
                        nearly a third of total
                        search volume.
80.00%                                                                                            Reverse Running Total
                                                                                                  (Searches)
 71.67%
                                                                                                  Reverse Running Total
70.00%                                                                                            (Queries)


60.00%
                                                                                   A small number of
                                                                                   queries (3%) make
50.00%                                                                             up more than a third
                                                                                   of search volume.
40.00%
                                                           34.30%
 31.87%

30.00%


20.00%


10.00%
                                                          2.78%

 0.00%
          1     2   3       4     5     6     7   8   9      10     11   12   13     14     15    16      17   18     19   20




                                                                                                                           page 14
User Segmentation




                    page 15
All of this is great, but…


Most of these efforts are driven by development
 teams.


The challenge now is to unlock the value in this data
 by making it more available to the rest of the
 organization.




                                                        page 16
“Given the ubiquity of data in modern organizations, a data
warehouse can keep pace today only by being “magnetic”:
attracting all the data sources that crop up within an
organization regardless of data quality niceties.”*




             *MAD Skills: New Analysis Practices for Big Data


                                                              page 17
In a better world…	





                        page 18
Integrating Hadoop with the Enterprise Data Warehouse

                   Robert Lancaster and Jonathan Seidman
                                    Chicago Data Summit
                                           April 26 | 2011
The goal is a unified view of the data, allowing us to use
the power of our existing tools for reporting and analysis.




                                                              page 20
BI vendors are working on integration with Hadoop…




                                                     page 21
And one more reporting tool…




                               page 22
Example Processing Pipeline for Web Analytics Data




                                                     page 23
Aggregating data for import into Data Warehouse




                                                  page 24
Example Use Case: Beta Data Processing




                                     page 25
Example Use Case – Beta Data Processing




                                          page 26
Example Use Case – Beta Data Processing Output




                                                 page 27
Example Use Case: RCDC Processing




                                    page 28
Example Use Case – RCDC Processing




                                     page 29
Example Use Case: Click Data Processing




                                      page 30
Click Data Processing – Current DW Processing




Web
                                           Data
Server	

 Web                                       Cleansing
  Web
 Server	

   Logs   ETL             DW     (Stored            DW
  Servers
                                           procedure)

                          3 hours          2 hours            ~20%
                                                             original
                                                               data
                                                               size



                                                                   page 31
Click Data Processing – New Hadoop Processing




Web                        Data
Server	

 Web                       Cleansing
  Web
 Server	

   Logs   HDFS   (MapReduce)      DW
  Servers




                                                 page 32
Conclusions


•  Market is still immature, but Hadoop has already become a
   valuable business intelligence tool, and will become an
   increasingly important part of a BI infrastructure.
•  Hadoop won’t replace your EDW, but any organization with a
   large EDW should at least be exploring Hadoop as a
   complement to their BI infrastructure.
•  Use Hadoop to offload the time and resource intensive
   processing of large data sets so you can free up your data
   warehouse to serve user needs.
•  The challenge now is making Hadoop more accessible to non-
   developers. Vendors are addressing this, so expect rapid
   advancements in Hadoop accessibility.



                                                                page 33
Oh, and also…


•  Orbitz is looking for a Lead Engineer for the BI/Big Data team.
•  Go to http://careers.orbitz.com/ and search for IRC19035.




                                                                     page 34
References


•  MAD Skills: New Analysis Practices for Big Data, Jeffrey
   Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, and
   Caleb Welton, 2009




                                                              page 35

More Related Content

What's hot

Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IEdureka!
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1Abbas Maazallahi
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
Flexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinFlexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinDmitriy Ryaboy
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
NTT Data - Shinichi Yamada - Hadoop World 2010
NTT Data - Shinichi Yamada - Hadoop World 2010NTT Data - Shinichi Yamada - Hadoop World 2010
NTT Data - Shinichi Yamada - Hadoop World 2010Cloudera, Inc.
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInHari Shankar Sreekumar
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveGeekNightHyderabad
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
VMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware HadoopVMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware HadoopVMUG IT
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 

What's hot (20)

Introduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -IIntroduction to Big data & Hadoop -I
Introduction to Big data & Hadoop -I
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Flexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant TwinFlexible In-Situ Indexing for Hadoop via Elephant Twin
Flexible In-Situ Indexing for Hadoop via Elephant Twin
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
NTT Data - Shinichi Yamada - Hadoop World 2010
NTT Data - Shinichi Yamada - Hadoop World 2010NTT Data - Shinichi Yamada - Hadoop World 2010
NTT Data - Shinichi Yamada - Hadoop World 2010
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedIn
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's Perspective
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
VMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware HadoopVMUGIT UC 2013 - 08a VMware Hadoop
VMUGIT UC 2013 - 08a VMware Hadoop
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 

Viewers also liked

Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatternsAnurag S
 
Your Path to Big Data Sucess
Your Path to Big Data SucessYour Path to Big Data Sucess
Your Path to Big Data SucessCloudera, Inc.
 
SITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsSITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsJan van Ansem
 
BotPrize 2014 Results. Human-Like Bots Competition at IEEE CIG
BotPrize 2014 Results. Human-Like Bots Competition at IEEE CIGBotPrize 2014 Results. Human-Like Bots Competition at IEEE CIG
BotPrize 2014 Results. Human-Like Bots Competition at IEEE CIGAccenture Analytics
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010Jonathan Seidman
 
Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...
Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...
Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...Trivadis
 
Integrating BI - Data Warehouse and Big Data
Integrating BI - Data Warehouse and Big DataIntegrating BI - Data Warehouse and Big Data
Integrating BI - Data Warehouse and Big DataAccenture Analytics
 
Edw Data Arc
Edw Data ArcEdw Data Arc
Edw Data ArcAlex CK
 
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel UptonEDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel UptonDaniel Upton
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache ApexApache Apex
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data WarehousingThomas Kejser
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataApache Apex
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...Kai Wähner
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution Hortonworks
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 

Viewers also liked (20)

Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatterns
 
Your Path to Big Data Sucess
Your Path to Big Data SucessYour Path to Big Data Sucess
Your Path to Big Data Sucess
 
SITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small PocketsSITNL 2015 - Big Data Small Pockets
SITNL 2015 - Big Data Small Pockets
 
BotPrize 2014 Results. Human-Like Bots Competition at IEEE CIG
BotPrize 2014 Results. Human-Like Bots Competition at IEEE CIGBotPrize 2014 Results. Human-Like Bots Competition at IEEE CIG
BotPrize 2014 Results. Human-Like Bots Competition at IEEE CIG
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...
Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...
Trivadis TechEvent 2016 DWH Modernization – in the Age of Big Data by Gregor ...
 
Integrating BI - Data Warehouse and Big Data
Integrating BI - Data Warehouse and Big DataIntegrating BI - Data Warehouse and Big Data
Integrating BI - Data Warehouse and Big Data
 
Edw Data Arc
Edw Data ArcEdw Data Arc
Edw Data Arc
 
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel UptonEDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
EDW Data Model Storming for Integration of NoSQL and RDBMS by Daniel Upton
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
The EDW Ecosystem
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 

Similar to Extending Enterprise Data Warehouse Hadoop

Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopCloudera, Inc.
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Jonathan Seidman
 
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Cloudera, Inc.
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitzRaghu Kashyap
 
Mass tlc presentation menninger
Mass tlc presentation    menningerMass tlc presentation    menninger
Mass tlc presentation menningerMassTLC
 
Mass tlc presentation menninger
Mass tlc presentation    menningerMass tlc presentation    menninger
Mass tlc presentation menningerMassTLC
 
Konrad Feldman - Big Data and The Future of Advertising and Marketing - SIC2012
Konrad Feldman - Big Data and The Future of Advertising and Marketing - SIC2012Konrad Feldman - Big Data and The Future of Advertising and Marketing - SIC2012
Konrad Feldman - Big Data and The Future of Advertising and Marketing - SIC2012Seattle Interactive Conference
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Jonathan Seidman
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.docbutest
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop siliconsudipt
 

Similar to Extending Enterprise Data Warehouse Hadoop (20)

Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with HadoopChicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
Chicago Data Summit: Extending the Enterprise Data Warehouse with Hadoop
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011
 
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
Hadoop World 2011: Extending Enterprise Data Warehouse with Hadoop - Jonathan...
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitz
 
Mass tlc presentation menninger
Mass tlc presentation    menningerMass tlc presentation    menninger
Mass tlc presentation menninger
 
Mass tlc presentation menninger
Mass tlc presentation    menningerMass tlc presentation    menninger
Mass tlc presentation menninger
 
Big Data
Big DataBig Data
Big Data
 
Konrad Feldman - Big Data and The Future of Advertising and Marketing - SIC2012
Konrad Feldman - Big Data and The Future of Advertising and Marketing - SIC2012Konrad Feldman - Big Data and The Future of Advertising and Marketing - SIC2012
Konrad Feldman - Big Data and The Future of Advertising and Marketing - SIC2012
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.doc
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Big Data & Data Mining
Big Data & Data MiningBig Data & Data Mining
Big Data & Data Mining
 

More from Jonathan Seidman

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Jonathan Seidman
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_finalJonathan Seidman
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Jonathan Seidman
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Jonathan Seidman
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 

More from Jonathan Seidman (9)

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_final
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Extending Enterprise Data Warehouse Hadoop

  • 1. Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011
  • 2. Who We Are •  Robert Lancaster –  Solutions Architect, Hotel Supply Team –  rlancaster@orbitz.com –  @rob1lancaster •  Jonathan Seidman –  Lead Engineer, Business Intelligence/Big Data Team –  Co-founder/organizer of Chicago Hadoop User Group (http://www.meetup.com/Chicago-area-Hadoop-User- Group-CHUG) –  jseidman@orbitz.com –  @jseidman page 2
  • 4. Why are we using Hadoop? Stop me if you’ve heard this before… page 4
  • 5. On Orbitz alone we do millions of searches and transactions daily, which leads to hundreds of gigabytes of log data every day. page 5
  • 6. Hadoop provides us with efficient, economical, scalable, and reliable storage and processing of these large amounts of data. $ per TB page 6
  • 7. And… Hadoop places no constraints on how data is processed. page 7
  • 8. Before Hadoop page 8
  • 9. With Hadoop page 9
  • 10. Access to this non-transactional data enables a number of applications… page 10
  • 12. Recommendations page 12
  • 14. Cache Analysis 100.00% 72% of queries are Queries singletons and make up 90.00% Searches nearly a third of total search volume. 80.00% Reverse Running Total (Searches) 71.67% Reverse Running Total 70.00% (Queries) 60.00% A small number of queries (3%) make 50.00% up more than a third of search volume. 40.00% 34.30% 31.87% 30.00% 20.00% 10.00% 2.78% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 page 14
  • 15. User Segmentation page 15
  • 16. All of this is great, but… Most of these efforts are driven by development teams. The challenge now is to unlock the value in this data by making it more available to the rest of the organization. page 16
  • 17. “Given the ubiquity of data in modern organizations, a data warehouse can keep pace today only by being “magnetic”: attracting all the data sources that crop up within an organization regardless of data quality niceties.”* *MAD Skills: New Analysis Practices for Big Data page 17
  • 18. In a better world… page 18
  • 19. Integrating Hadoop with the Enterprise Data Warehouse Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011
  • 20. The goal is a unified view of the data, allowing us to use the power of our existing tools for reporting and analysis. page 20
  • 21. BI vendors are working on integration with Hadoop… page 21
  • 22. And one more reporting tool… page 22
  • 23. Example Processing Pipeline for Web Analytics Data page 23
  • 24. Aggregating data for import into Data Warehouse page 24
  • 25. Example Use Case: Beta Data Processing page 25
  • 26. Example Use Case – Beta Data Processing page 26
  • 27. Example Use Case – Beta Data Processing Output page 27
  • 28. Example Use Case: RCDC Processing page 28
  • 29. Example Use Case – RCDC Processing page 29
  • 30. Example Use Case: Click Data Processing page 30
  • 31. Click Data Processing – Current DW Processing Web Data Server Web Cleansing Web Server Logs ETL DW (Stored DW Servers procedure) 3 hours 2 hours ~20% original data size page 31
  • 32. Click Data Processing – New Hadoop Processing Web Data Server Web Cleansing Web Server Logs HDFS (MapReduce) DW Servers page 32
  • 33. Conclusions •  Market is still immature, but Hadoop has already become a valuable business intelligence tool, and will become an increasingly important part of a BI infrastructure. •  Hadoop won’t replace your EDW, but any organization with a large EDW should at least be exploring Hadoop as a complement to their BI infrastructure. •  Use Hadoop to offload the time and resource intensive processing of large data sets so you can free up your data warehouse to serve user needs. •  The challenge now is making Hadoop more accessible to non- developers. Vendors are addressing this, so expect rapid advancements in Hadoop accessibility. page 33
  • 34. Oh, and also… •  Orbitz is looking for a Lead Engineer for the BI/Big Data team. •  Go to http://careers.orbitz.com/ and search for IRC19035. page 34
  • 35. References •  MAD Skills: New Analysis Practices for Big Data, Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, and Caleb Welton, 2009 page 35