SlideShare uma empresa Scribd logo
1 de 33
HADOOP AND YOUR DATA
WAREHOUSE
About Caserta Concepts
• Technology services company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Higher Education
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
When it comes to building efficient Data
Solutions, We wrote the book!
• Time-tested proven solutions
• Staging, cleaning, integrating, delivering
• Traditional data warehousing and Big Data
warehouses
• Best practices to extract data from scattered
sources, munge and discover valuable
business information
• Sub-systems offered as project accelerators
• Comprehensive guidance to our clients to
build and populate big data solutions that
ensure quality and integrity
Authors, Innovators, Leaders
Community
Hadoop and Your Data Warehouse
•Last 2 or 3 years have been more disruptive from a
data management perspective than the past 20!
•The advent of new technologies and modern data
engineering concepts has shaken traditional
concepts to their core
Proprietary Information
What is a Data Warehouse
Good Question?
In the traditional world – several competing, almost
religious, approaches to their design.
I think we can all agree:
•A central repository of integrated data from one or
more disparate sources
•Used for Reporting and Analysis
•Reliability, Trust  Data Governance
Proprietary Information
Data Governance
A program consisting of:
• Metadata
• Security
• Data Quality
• Master Data Management
• Information Lifecycle Management (aka retention)
…with supporting processes, procedures and
organizational suport
How Do you Build a Data Warehouse
•Design – Top Down, Bottom Up
• Customer Interviews and requirements gathering
• Data Profiling
•Extract Transform Load data from source to data
warehouse
•Create Facts and Dimensions
•Put a BI tool on top
•Develop reports
•Data Governance
Proprietary Information
The Traditional Conversation
• Kimball Vs. Inmon
• Dimensional vs. 3rd normal form
• What hardware do we need (that will be ready in 6 months)
• Oracle vs SQL Server, Postgres or MySQL if we were brave
(and cheap)
• Which ETL tool should we BUY  Informatica, Datastage?
• Which BI tool should we sit on top  Business Objects,
Cognos?
Proprietary Information
The New Conversation
• Do we need a Data Warehouse at all?
• If we do, does it need to be relational?
• Should we leverage Hadoop or NoSQL?
• Which platform and language are we going to code in?
• What bleeding edge Apache Project should we put in
production!
Proprietary Information
So Why Change?
New technologies are great and all.. But what drives our
adoption of new technologies and techniques?
• Data has changed – Semistructured, Unstructured, Sparse
and evolving schema
• Volumes have changed  GB to TB to PB workloads
• Cracks in the Armor of Traditional Data Warehousing
approach!
AND MOST IMPORTANTLY:
Companies that innovate and leverage their data win!
Proprietary Information
Cracks in the Armor
• Onboarding new data is difficult!
• Rigidity and Data Governance
• Disconnect from business requirement:
“Hey – I need to analyze some new source”
Conform and analyze the data
Load it into dimensional models
Build a semantic layer nobody is going to use
Create a dashboard we hope someone will notice
..and then you can have at it 3-6 months later to see if it has
value!
Proprietary Information
And then there is…
70% FAILURE RATE
• Semi-scientific analysis has proven the majority of data
analytic projects fail..
• And of those that don’t fail, only a fraction are deemed a
“success”, others just finish!
• Data is just REALLY hard, especially without the right
strategy
What do we think the Data Governance failure rate is?
Proprietary Information
+= Data Scientist
• New breed of data consumers
• They love the conformed clean warehouse data
• But they also are responsible for new insights
• Data not yet modeled in the data warehouse  source
data
• New Data Sources
• Workloads that are supported by traditional facts and
dimensions: network analysis, text analytics, and many
more..
Proprietary Information
Traditional Warehousing All Wrong?
NO!
The concept of a Data Warehouse is sound:
• Consolidating data from disparate source systems
• Clean and conformed reference data
• Clean and integrated business facts
• Data governance (a more pragmatic version)
We can be more successful by acknowledging the EDW
can’t solve all problems.
Proprietary Information
So what’s missing?
The Data Lake
A storage and processing layer for all data
• Store anything: source data, semi-structured,
unstructured, structured
• Keep it as long as needed
• Support a number of processing workloads
• Scale-out
..and here is where Hadoop
can help us!
Proprietary Information
Hadoop Powers the Data Lake
Hadoop Provides us:
• Distributed storage  HDFS
• Resource Management  YARN
• Many workloads, not just Map Reduce
Proprietary Information
..but we need to think Holistic Data Strategy
Proprietary Information
Big
Data
Warehouse
Data Science
Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
• Data Governance is
tunable and
pragmatic
• Some analytics are
suited for the Data
Warehouse, while
many are not
Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
About those layers
Metadata  Catalog
ILM  who has access,
how long do we “manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned
into information:
organized, well defined,
complete.
Agile business insight through data-munging,
machine learning, blending with external
data, development of to-be BDW facts
Metadata  Catalog
ILM  who has access, how long to “manage it”
Data Quality and Monitoring  Monitoring
of completeness of data
Metadata  Catalog
ILM  who has access, how long do we “manage it”
 The Hadoop Data Lake has different governance
demands at each tier.
 Only top tier of the pyramid is fully governed.
 We refer to this as the Trusted tier of the Big Data
Warehouse.
Fully Data Governed ( trusted)User community arbitrary queries and reporting
1
2
4
3
Why we need “Tunable” Data Governance
•Dumping data into Hadoop with no repeatable
process, procedure, or data governance will create
a mess
• No Data Conformance
• No Master Data Management
• No Data Quality processes
• No Trust
..the alternative is applying Data Governance to
rigidly?
Peeling back the layer…
Landing
•Source data in it’s full fidelity
•Programmatically Loaded
•Partitioned for data processing
•No governance other than catalog and ILM (Security
and Retention)
•Consumers: Data Scientists, ETL Processes,
Applications
Proprietary Information
Data Lake
•Enriched, lightly integrated
•Data has been is accessible in the Hive Metastore
• Either processed into tabular relations
• Or via Hive Serdes directly upon Raw Data
•Partitioned for data access
•Governance additionally includes a guarantee of
completeness
•Consumers: Data Scientists, ETL Processes,
Applications, Data Analysts
Proprietary Information
Side Note – Unstructured Data
 A Structure must be extracted/applied in just about every
case imaginable before analysis can be performed.
Full data governance can only be applied to “Structured”
data
This can include materialized endpoints such as files or
tables OR projections such as a Hive table
Governed structured data must have:
 A known schema with Metadata
 A known and certified lineage
 A monitored, quality test, managed process for ingestion
and transformation
Data Science Workspace
•No barrier for onboarding and analysis of new data
•Blending of new data with entire Data Lake,
including the Big Data Warehouse
•No governance other than ILM
•Consumers: Data Scientists Only!
Proprietary Information
Big Data Warehouse
•Data is Fully Governed
•Data is Structured
•Partitioned/tuned for data access
•Governance includes a guarantee of completeness
and accuracy
•Consumers: Data Scientists, ETL Processes,
Applications, Data Analysts, and Business Users
Proprietary Information
Big
Data
Warehouse
The Refinery
BDW
Data Science
Workspace
Data Lake
Landing Area
Cool
new
data
New
Insights
Proprietary Information
•The feedback loop between Data Science and Data
Warehouse is critical
•Successful work products of science must Graduate
into the appropriate layers of the Data Lake
So where does this Big Data Warehouse Live?
Per Martin Fowler (http://martinfowler.com):
“Polygot Persistence - where any decent sized
enterprise will have a variety of different data
storage technologies for different kinds of data.
There will still be large amounts of it managed in
relational stores, but increasingly we'll be first asking
how we want to manipulate the data and only then
figuring out what technology is the best bet for it…”
Abridged Version: Use the right tool for the job!
Proprietary Information
Polygot Warehouse
We promote the concept that the Big Data
Warehouse may live in one or more platforms
•Full Hadoop Solutions
•Hadoop plus MPP or Relational
Supplemental technologies:
•NoSQL: Columnar, Key value, Timeseries, Graph
•Search Technologies
Proprietary Information
Hadoop Data Warehouse
•Hadoop is the platform for the entire data lake
including the Big Data Warehouse
•Serves as the Data Lake and “Refinery”
•Query engines such as Hive, and Impala provide SQL
support
Proprietary Information
Hadoop + Relational
•Hadoop is the platform for the Data Lake and
Refinery
•The Active Set is federated out into MPP or
Relational Platforms  Presentation Layer
•Serves as a good model when there is existing MPP
or Relational Data Warehouse in place
Proprietary Information
On the Cloud
AWS and other cloud providers present a very
powerful design pattern:
•S3 serves as the storage layer for the Data Lake
•EMR (Elastic Hadoop) provides the Refinery, most
clusters can be ephemeral
•The Active Set is stored into Redshift MPP or
Relational Platforms
Replace massive on premise footprint with a only a
handful of machines!
Proprietary Information
In Summary
•The principles of Data Warehousing still
makes sense
•Recognize gaps in feature/functionality of the
Relational Database, and traditional Data
Warehousing
•Believe in the Data Lake and accept Tunable
Governance
•Think Polygot Warehouse and use the right
tool for the job
Proprietary Information
Thank You
Elliott Cordo
Chief Architect
elliott@casertaconcepts.com

Mais conteúdo relacionado

Mais procurados

2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopDatameer
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Data Con LA
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeCaserta
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveGeekNightHyderabad
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
The Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThe Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThomas Kelly, PMP
 
Creating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data ArchitectureCreating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data ArchitecturePerficient, Inc.
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?Slim Baltagi
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lakeBHASKAR CHAUDHURY
 

Mais procurados (20)

2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's Perspective
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
The Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThe Emerging Data Lake IT Strategy
The Emerging Data Lake IT Strategy
 
Creating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data ArchitectureCreating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data Architecture
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lake
 

Semelhante a Hadoop and Your Data Warehouse

5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadhMithlesh Sadh
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaCaserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseCaserta
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsCaserta
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementTony Bain
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
 

Semelhante a Hadoop and Your Data Warehouse (20)

5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Ask bigger questions
Ask bigger questionsAsk bigger questions
Ask bigger questions
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with Cloudera
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data warehouseold
Data warehouseoldData warehouseold
Data warehouseold
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data Management
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 

Mais de Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 

Mais de Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 

Último

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Hadoop and Your Data Warehouse

  • 1. HADOOP AND YOUR DATA WAREHOUSE
  • 2. About Caserta Concepts • Technology services company with expertise in data analysis: • Big Data Solutions • Data Warehousing • Business Intelligence • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Higher Education • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy, Implementation • Writing, Education, Mentoring • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  • 3. When it comes to building efficient Data Solutions, We wrote the book! • Time-tested proven solutions • Staging, cleaning, integrating, delivering • Traditional data warehousing and Big Data warehouses • Best practices to extract data from scattered sources, munge and discover valuable business information • Sub-systems offered as project accelerators • Comprehensive guidance to our clients to build and populate big data solutions that ensure quality and integrity Authors, Innovators, Leaders
  • 5. Hadoop and Your Data Warehouse •Last 2 or 3 years have been more disruptive from a data management perspective than the past 20! •The advent of new technologies and modern data engineering concepts has shaken traditional concepts to their core Proprietary Information
  • 6. What is a Data Warehouse Good Question? In the traditional world – several competing, almost religious, approaches to their design. I think we can all agree: •A central repository of integrated data from one or more disparate sources •Used for Reporting and Analysis •Reliability, Trust  Data Governance Proprietary Information
  • 7. Data Governance A program consisting of: • Metadata • Security • Data Quality • Master Data Management • Information Lifecycle Management (aka retention) …with supporting processes, procedures and organizational suport
  • 8. How Do you Build a Data Warehouse •Design – Top Down, Bottom Up • Customer Interviews and requirements gathering • Data Profiling •Extract Transform Load data from source to data warehouse •Create Facts and Dimensions •Put a BI tool on top •Develop reports •Data Governance Proprietary Information
  • 9. The Traditional Conversation • Kimball Vs. Inmon • Dimensional vs. 3rd normal form • What hardware do we need (that will be ready in 6 months) • Oracle vs SQL Server, Postgres or MySQL if we were brave (and cheap) • Which ETL tool should we BUY  Informatica, Datastage? • Which BI tool should we sit on top  Business Objects, Cognos? Proprietary Information
  • 10. The New Conversation • Do we need a Data Warehouse at all? • If we do, does it need to be relational? • Should we leverage Hadoop or NoSQL? • Which platform and language are we going to code in? • What bleeding edge Apache Project should we put in production! Proprietary Information
  • 11. So Why Change? New technologies are great and all.. But what drives our adoption of new technologies and techniques? • Data has changed – Semistructured, Unstructured, Sparse and evolving schema • Volumes have changed  GB to TB to PB workloads • Cracks in the Armor of Traditional Data Warehousing approach! AND MOST IMPORTANTLY: Companies that innovate and leverage their data win! Proprietary Information
  • 12. Cracks in the Armor • Onboarding new data is difficult! • Rigidity and Data Governance • Disconnect from business requirement: “Hey – I need to analyze some new source” Conform and analyze the data Load it into dimensional models Build a semantic layer nobody is going to use Create a dashboard we hope someone will notice ..and then you can have at it 3-6 months later to see if it has value! Proprietary Information
  • 13. And then there is… 70% FAILURE RATE • Semi-scientific analysis has proven the majority of data analytic projects fail.. • And of those that don’t fail, only a fraction are deemed a “success”, others just finish! • Data is just REALLY hard, especially without the right strategy What do we think the Data Governance failure rate is? Proprietary Information
  • 14. += Data Scientist • New breed of data consumers • They love the conformed clean warehouse data • But they also are responsible for new insights • Data not yet modeled in the data warehouse  source data • New Data Sources • Workloads that are supported by traditional facts and dimensions: network analysis, text analytics, and many more.. Proprietary Information
  • 15. Traditional Warehousing All Wrong? NO! The concept of a Data Warehouse is sound: • Consolidating data from disparate source systems • Clean and conformed reference data • Clean and integrated business facts • Data governance (a more pragmatic version) We can be more successful by acknowledging the EDW can’t solve all problems. Proprietary Information
  • 16. So what’s missing? The Data Lake A storage and processing layer for all data • Store anything: source data, semi-structured, unstructured, structured • Keep it as long as needed • Support a number of processing workloads • Scale-out ..and here is where Hadoop can help us! Proprietary Information
  • 17. Hadoop Powers the Data Lake Hadoop Provides us: • Distributed storage  HDFS • Resource Management  YARN • Many workloads, not just Map Reduce Proprietary Information
  • 18. ..but we need to think Holistic Data Strategy Proprietary Information Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” • Data Governance is tunable and pragmatic • Some analytics are suited for the Data Warehouse, while many are not
  • 19. Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” About those layers Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data-munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long to “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it”  The Hadoop Data Lake has different governance demands at each tier.  Only top tier of the pyramid is fully governed.  We refer to this as the Trusted tier of the Big Data Warehouse. Fully Data Governed ( trusted)User community arbitrary queries and reporting 1 2 4 3
  • 20. Why we need “Tunable” Data Governance •Dumping data into Hadoop with no repeatable process, procedure, or data governance will create a mess • No Data Conformance • No Master Data Management • No Data Quality processes • No Trust ..the alternative is applying Data Governance to rigidly?
  • 21. Peeling back the layer… Landing •Source data in it’s full fidelity •Programmatically Loaded •Partitioned for data processing •No governance other than catalog and ILM (Security and Retention) •Consumers: Data Scientists, ETL Processes, Applications Proprietary Information
  • 22. Data Lake •Enriched, lightly integrated •Data has been is accessible in the Hive Metastore • Either processed into tabular relations • Or via Hive Serdes directly upon Raw Data •Partitioned for data access •Governance additionally includes a guarantee of completeness •Consumers: Data Scientists, ETL Processes, Applications, Data Analysts Proprietary Information
  • 23. Side Note – Unstructured Data  A Structure must be extracted/applied in just about every case imaginable before analysis can be performed. Full data governance can only be applied to “Structured” data This can include materialized endpoints such as files or tables OR projections such as a Hive table Governed structured data must have:  A known schema with Metadata  A known and certified lineage  A monitored, quality test, managed process for ingestion and transformation
  • 24. Data Science Workspace •No barrier for onboarding and analysis of new data •Blending of new data with entire Data Lake, including the Big Data Warehouse •No governance other than ILM •Consumers: Data Scientists Only! Proprietary Information
  • 25. Big Data Warehouse •Data is Fully Governed •Data is Structured •Partitioned/tuned for data access •Governance includes a guarantee of completeness and accuracy •Consumers: Data Scientists, ETL Processes, Applications, Data Analysts, and Business Users Proprietary Information Big Data Warehouse
  • 26. The Refinery BDW Data Science Workspace Data Lake Landing Area Cool new data New Insights Proprietary Information •The feedback loop between Data Science and Data Warehouse is critical •Successful work products of science must Graduate into the appropriate layers of the Data Lake
  • 27. So where does this Big Data Warehouse Live? Per Martin Fowler (http://martinfowler.com): “Polygot Persistence - where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it…” Abridged Version: Use the right tool for the job! Proprietary Information
  • 28. Polygot Warehouse We promote the concept that the Big Data Warehouse may live in one or more platforms •Full Hadoop Solutions •Hadoop plus MPP or Relational Supplemental technologies: •NoSQL: Columnar, Key value, Timeseries, Graph •Search Technologies Proprietary Information
  • 29. Hadoop Data Warehouse •Hadoop is the platform for the entire data lake including the Big Data Warehouse •Serves as the Data Lake and “Refinery” •Query engines such as Hive, and Impala provide SQL support Proprietary Information
  • 30. Hadoop + Relational •Hadoop is the platform for the Data Lake and Refinery •The Active Set is federated out into MPP or Relational Platforms  Presentation Layer •Serves as a good model when there is existing MPP or Relational Data Warehouse in place Proprietary Information
  • 31. On the Cloud AWS and other cloud providers present a very powerful design pattern: •S3 serves as the storage layer for the Data Lake •EMR (Elastic Hadoop) provides the Refinery, most clusters can be ephemeral •The Active Set is stored into Redshift MPP or Relational Platforms Replace massive on premise footprint with a only a handful of machines! Proprietary Information
  • 32. In Summary •The principles of Data Warehousing still makes sense •Recognize gaps in feature/functionality of the Relational Database, and traditional Data Warehousing •Believe in the Data Lake and accept Tunable Governance •Think Polygot Warehouse and use the right tool for the job Proprietary Information
  • 33. Thank You Elliott Cordo Chief Architect elliott@casertaconcepts.com

Notas do Editor

  1. We focused our attention on building a single version of the truth We mainly applied data governance on the EDW itself and a few primary supporting systems –like MDM. We had a fairly restrictive set of tools for using the EDW data  Enterprise BI tools  It was easier to GOVERN how the data would be used.