SlideShare uma empresa Scribd logo
1 de 18
PwC Advisory
Apache Hadoop
Summit ‘ 2016
The Future of Apache Hadoop
An Enterprise Architecture View
www.pwc.com/unlockdatapossibilities
2
Presenters
Oliver Halter
Partner, Information Strategy and Big Data
oliver.halter@pwc.com
Ritesh Ramesh
Chief Technologist, Global Data and Analytics
ritesh.ramesh@pwc.com
3
Contents
1 2 3 4 5
Trends Challenges Opportunities Accelerating
adoption
through a
Capability
Driven Approach
Real life
Case
Studies/Lessons
Learnt
4
PwC's global data & analytics surveys & trends
PwC, 2016 Global CEO Survey, January 2016 PwC, Global Data and Analytics Survey: 2016
Big Decisions™
73% Data and Analytics Technologies
generate the greatest return in terms of
engagement with wider stakeholders
32% Nearly one in three said developing or
launching new products and services is their
leading ‘big decision’. Does your data & analytics
effectively support you?
5
Although we are increasingly seeing the use of Hadoop among
mainstream companies key barriers still remain for its holistic
success and adoption as an enterprise platform
An
enterprise is
a complex
system of
components
Adoption Barriers
1 2 3 4
Incoherent
Enterprise View
Overcrowded
technology
ecosystem
Lack of User
Centricity
Siloed
Ownership
6
We believe external market forces will propel enterprises to
embrace the Data Lake as a foundation of their data, analytics and
emerging technology strategies
1.InternetofThings
3.Digital
4.ModernData
Management
2.ArtificialIntelligence
5.Analytics
6.CyberSecurity
Enterprise Data Lake
1. Grow the Business
2. Optimize Spend
3. Innovate
4. Mitigate Risks
Emerging
Technology
Platforms
Connecting the dots
between various
strategic technology
initiatives within the
enterprise is going to
be critical to
capitalize on the
opportunity....
7
There are lots of opportunities to innovate and accelerate
enterprise adoption of Hadoop by abstracting sophistication with
simplicity and superior end user experience
Existing Innovations enabling Acceleration Opportunities to close the gaps
Cloud based Marketplaces and Solutions
Third Party on-demand, ‘Smart’ Data Wrangling
solutions leveraging high performance
components in Hadoop
Open Source Analytics and AI Libraries
Third Party ‘Hadoop in a Box’ integrated solutions
Vendor distributions and developer communities
– well established
1
2
3
4
5
Data extraction and semantic text analytics
libraries for complex data structures – Nested
XML’s, PDF’s and Unstructured Data
Model Management and integration tools
facilitating seamless interoperability or migration
from existing technology investments ( data
warehouses and applications)
Bringing Visualization to the data stored with
Hadoop with native libraries and third party tools
Adaptive & Dynamic Workload Management
Native Data Masking and Encryption Features
1
2
3
4
5
8
Jumpstart/accelerate Hadoop journey with these 4 core tenets
Capability
Driven
1
Right Fit3
Flexible Operating Model4
Heterogenous2
Third Party
Tool Integration
PwC’s Next
Generation
Information
Architecture
1 2
34
Cloud
Interoperability
Legacy
Integration
Data
Migration
On-Premise
Cloud
In-Memory
Disk based
NoSQL typesSupport Model
Training
Use Cases/
Demad Intake
Services
Catalog
Business
Adoption
Innovation
Platform
Monetization
Analytics
Application
Development
Enterprise
Data
Mnagemnet
*https://www.pwc.com/us/infoarchitecture
9
Tenet 1: Capability Driven
Focus on capturing the current and future information and analytics needs of every business
function and external partners to drive the architecture
PwC’s Data
Lake
Capability
Framework
Data Quality/
Integration
2
Data
Architecture
3
Metadata
Management
4
Analytics/
Reporting/
Visualization
5
Data
Access
6
Security
7
Governance/
Organization
8
1
Data
Ingestion
Modern data management technologies (ELT based, Data
wrangling etc.) used for cleansing, standardizing and
integrating the data from multiple internal and external
sources leveraging the scalable computing platform
Ability to manage and store data in normalized or
denormalized structures on disk, in-memory,
row vs. columnar vs. column family based data stores
(Hive, Spark, HBase, RDBMS etc.) in depending on
the use cases
Ability to track data sources ingested into the data
lake, track data lineage and provenance of storage and
processing activities
Metrics, Tools and processes required to visualize and
comprehend data stored in the data stores in form of
reports, dashboards and scorecards for business users
Ability to ingest data in batch & real time modes
in various forms –Databases, Files, Streams
and Queues
Centralized and coordinated management
of projects/activities, managing change
and communication of key milestones
and business benefits
Capabilities to secure personally
identifiable information in the next
generation platform and create role based
access to business users
Ability to access stored data from
the Platform through a consistent &
secure API
10
Tenet 2: Heterogeneous
Hybrid set of both traditional and emerging technologies and platforms to acquire, store,
interlock and analyze internal and external data will be the norm going forward. Design
for simplicity and iteratively build your modular architecture with transition states towards
the target
Sources of
Known Value
Sales Transactions
Customer
Product
Physical Assets
Sources of
Unproven Value
Call Center
Social Media
Web Clickstream
Mobile Interactions
Data Ingestion Layer
ETL Connectors
Sqoop
Kafka
Flume
Emerging – Open Source
Illustrative model from a national retailer
Emerging – Licensed Traditional – Licensed Licensed+Open Source
ETL
Match-Merge
Services
Metadata
Management
Spark
Data Analytics/
Visualization
Standardized
Reporting
On-Demand/
Adhoc
Analytics
Modeling
API based Apps.
ELT
Relational
Schemas
Enterprise Data warehouse
Data Exchange
HDFS
RDD HBase
Data
Wrangling
Hive
(Parquet)
Enterprise Data Lake
11
Tenet 3: Right Fit
Enterprises need to develop a decision model which identifies the mix of ‘right fit’ open source
as well as commercial solution components, either hosted on the cloud or On Premise, based on
functionality and business needs
Illustrative
On Premise
Build ? Buy ?
Vendor Dist. ? Constraints ? Base Platform ? End-End Stack ?
3rd party
Cloud/Tools?
Security? Cloud integration?
Pre-Requisites
(Hardware, Drivers, Software Interoperability)
Cloud
Build ? Buy ?
3rd party
Cloud/Tools?
Security?
On Premise
Integration?
Pre-Requisites
(Hardware, Drivers, Software Interoperability)
Cloud Vendor ?
Vendor Dist.
(IaaS)?
Which Native
Services (PaaS)?
12
Tenet 4: Flexible Operating Model
Recognizes the sophistication and analytics maturity at a business function level and enables
the required capabilities with the necessary skills, processes, tools and support
1. Business alignment on how Haddoop environment will
operate. This includes defining
- Services Catalog
- Service level Agreements
- Tracking Usage, Benefits and Costs
- User Onboarding & training
2. Defining the Business architecture
- Identify capability areas and opportunities to inform the
Big Data Strategy
- Use Case Evaluation (risk, feasibility and business case)
- Prioritization criteria
- Demand / Intake process
- Business Roadmap
1. Technology Alignment on how the Hadoop environment
will operate. This includes defining:
- Access Model (Self service vs. Controlled)
- Data acquisition and classification strategy
- Organization (Develop vs. Support)
- Technical Skills Training
2. Defining the Technology architecture
- Architecture Guiding Principles
- Leading practices for data acquisition, management and
delivery
- Reference Architecture with solution patterns for the
various use cases
- Storage and infrastructure Planning
- Security Model
Business Operating Model Technology Operating Model
13
Five step strategic approach to build a strong data lake foundation
Recognizes the sophistication and analytics maturity at a business function level and enables
the required capabilities with the necessary skills, processes, tools and support
Capabilities
Leveraging client’s stated capabilities and PwC’s Capability framework with business interviews, analytical capabilities are
captured and documented1
Use Case Specifications
Define success criteria, information sources, dimensionality and information delivery mechanism for each use case. Each Use
Case must be mapped to a set of Capabilities2
Platform Architecture &
Operating Model
Define end-end architecture components (‘lego blocks’) mapped to the capabilities identified with leading practices for
ingestion, management , analytics and visualization. Identifies the organization, process and support structure required for agility3
Strategic Roadmap
for Execution
Organize the initiatives in a sequenced roadmap with scope, duration and dependencies under various themes5
Architecture Patterns
Depict the architecture pattern at the use case level , leverages the logical architecture ‘lego blocks’ and also shows the
information flow, respective technology component and integration touch point with client’s systems4
14
Case Study # 1 – Financial Services Provider – Risk Modeling for
their Loans Portfolio
Current State
Future State
• The client developed a next generation information management
and analytics platform which was more business centric with an
operating model that enables agility, self service, faster data
management and deep analytics for the business stakeholders
• Data processing window was reduced from 8-10 hours to less than
30 minutes
• Business Users were able to access more granular historical data
for ad hoc analysis and analytics models
TableauSAS CSV Files
No capability to look
back history past the
last month of data
Sources two CSV
files (total ~ 3 M
rows of data)
Aggregation logic
performed – CSV
data files exported
Hadoop Distributed File System
TableauHive Spark
Aggregation and Data transformation logic
performed using HiveQL on 67M records
and 36 columns (14.7 GB of data in Hive,
16.3 GB in memory in Spark SQL)
Response time between
2s and ~ 1 min per filter
sourcing live data via
Spark SQL
Current Process – Adhoc Analysis – 8-10 hours
Future Process – Adhoc Analysis – < 30 minutes
• Lack of an integrated architecture and scalable technology
infrastructure contributed to data management challenges
• The business analytics and modeling teams were looking for more
self-sufficiency and process agility
• Lacked program leadership and program management discipline
specifically for third party services and solution providers
• Data Acquisition and management processes lacked a consistent
design and architecture and were heavily siloed on an application
– application basis
Any trademarks included are trademarks of their respective owners and are not affiliated with, nor endorsed by, PricewaterhouseCoopers LLP, its subsidiaries or affiliates.
15
Case Study # 2 – Leading Retail Distribution Company – Trade
Promotion Effectiveness
500k SKU’s, 250k customers, 5k suppliers, 6k Fleets
Current State
• On-premise, rigid infrastructure with serial data processing
and limited capacity
• Delayed data availability reducing applicability to impactful
business decisions
• No integration with 3rd party data is causing pain points with
vendor collaboration and data access
Future State
• Flexible, scalable, cloud-based infrastructure enabling multi-
stream data processing
• Near real-time data availability via Apache Spark data
processing providing valuable insights for decision making
• Easily supported visualization and reporting platforms
accessible by internal and vendors with simple access controls
Any trademarks included are trademarks of their respective owners and are not affiliated with, nor endorsed by, PricewaterhouseCoopers LLP, its subsidiaries or affiliates.
16
How is PwC Creating Awareness and Driving Adoption in the
Market
Thought Leadership /
Independent Research Strategic Alliances
• Google
• Microsoft
• Oracle
• SAP
Data & Analytics @Scale - Client Delivery
17
Closing Thoughts…....
• We believe external market forces will propel enterprises to embrace the Data Lake as a
foundation of their data, analytics and emerging technology strategies
• Although barriers remain for adoption by mainstream enterprises, there are ample
opportunities for innovation and acceleration by abstracting sophistication with
simplicity and superior end user experience
• Enterprises should follow 4 core tenets* while developing their Next Generation
Information Architecture Platform
• Keep the 5 step strategic ‘capability driven’ approach in mind!!
• Thanks for attending the session – please contact us with any questions!
© 2016 PwC. All rights reserved. PwC refers to the US member firm or one of its subsidiaries or affiliates, and may sometimes refer to the PwC
network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details.

Mais conteúdo relacionado

Mais procurados

Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
DataWorks Summit
 

Mais procurados (20)

Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!
 
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveDiscover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
 
Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...
Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...
Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
 

Semelhante a The Future of Apache Hadoop an Enterprise Architecture View

Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
Keyrus US Information
Keyrus US InformationKeyrus US Information
Keyrus US Information
Julian Tong
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Geoffrey Fox
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
 

Semelhante a The Future of Apache Hadoop an Enterprise Architecture View (20)

When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Big data journey to the cloud maz chaudhri 5.30.18
Big data journey to the cloud   maz chaudhri 5.30.18Big data journey to the cloud   maz chaudhri 5.30.18
Big data journey to the cloud maz chaudhri 5.30.18
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Keyrus US Information
Keyrus US InformationKeyrus US Information
Keyrus US Information
 
Keyrus US Information
Keyrus US InformationKeyrus US Information
Keyrus US Information
 
The Eco-System of AI and How to Use It
The Eco-System of AI and How to Use ItThe Eco-System of AI and How to Use It
The Eco-System of AI and How to Use It
 
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav MisraFrom Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
From Foundation to Mastery – Building a Mature Analytics Roadmap - Manav Misra
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
CSC - Presentation at Hortonworks Booth - Strata 2014
CSC - Presentation at Hortonworks Booth - Strata 2014CSC - Presentation at Hortonworks Booth - Strata 2014
CSC - Presentation at Hortonworks Booth - Strata 2014
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)
 
What's New in Pentaho 7.0?
What's New in Pentaho 7.0?What's New in Pentaho 7.0?
What's New in Pentaho 7.0?
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Big data an elephant business opportunities
Big data an elephant   business opportunitiesBig data an elephant   business opportunities
Big data an elephant business opportunities
 
Evaluation guide to Streaming Analytics
Evaluation guide to Streaming AnalyticsEvaluation guide to Streaming Analytics
Evaluation guide to Streaming Analytics
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Athira mp cv_latest - copy
Athira mp cv_latest - copyAthira mp cv_latest - copy
Athira mp cv_latest - copy
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 

Mais de DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Mais de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

The Future of Apache Hadoop an Enterprise Architecture View

  • 1. PwC Advisory Apache Hadoop Summit ‘ 2016 The Future of Apache Hadoop An Enterprise Architecture View www.pwc.com/unlockdatapossibilities
  • 2. 2 Presenters Oliver Halter Partner, Information Strategy and Big Data oliver.halter@pwc.com Ritesh Ramesh Chief Technologist, Global Data and Analytics ritesh.ramesh@pwc.com
  • 3. 3 Contents 1 2 3 4 5 Trends Challenges Opportunities Accelerating adoption through a Capability Driven Approach Real life Case Studies/Lessons Learnt
  • 4. 4 PwC's global data & analytics surveys & trends PwC, 2016 Global CEO Survey, January 2016 PwC, Global Data and Analytics Survey: 2016 Big Decisions™ 73% Data and Analytics Technologies generate the greatest return in terms of engagement with wider stakeholders 32% Nearly one in three said developing or launching new products and services is their leading ‘big decision’. Does your data & analytics effectively support you?
  • 5. 5 Although we are increasingly seeing the use of Hadoop among mainstream companies key barriers still remain for its holistic success and adoption as an enterprise platform An enterprise is a complex system of components Adoption Barriers 1 2 3 4 Incoherent Enterprise View Overcrowded technology ecosystem Lack of User Centricity Siloed Ownership
  • 6. 6 We believe external market forces will propel enterprises to embrace the Data Lake as a foundation of their data, analytics and emerging technology strategies 1.InternetofThings 3.Digital 4.ModernData Management 2.ArtificialIntelligence 5.Analytics 6.CyberSecurity Enterprise Data Lake 1. Grow the Business 2. Optimize Spend 3. Innovate 4. Mitigate Risks Emerging Technology Platforms Connecting the dots between various strategic technology initiatives within the enterprise is going to be critical to capitalize on the opportunity....
  • 7. 7 There are lots of opportunities to innovate and accelerate enterprise adoption of Hadoop by abstracting sophistication with simplicity and superior end user experience Existing Innovations enabling Acceleration Opportunities to close the gaps Cloud based Marketplaces and Solutions Third Party on-demand, ‘Smart’ Data Wrangling solutions leveraging high performance components in Hadoop Open Source Analytics and AI Libraries Third Party ‘Hadoop in a Box’ integrated solutions Vendor distributions and developer communities – well established 1 2 3 4 5 Data extraction and semantic text analytics libraries for complex data structures – Nested XML’s, PDF’s and Unstructured Data Model Management and integration tools facilitating seamless interoperability or migration from existing technology investments ( data warehouses and applications) Bringing Visualization to the data stored with Hadoop with native libraries and third party tools Adaptive & Dynamic Workload Management Native Data Masking and Encryption Features 1 2 3 4 5
  • 8. 8 Jumpstart/accelerate Hadoop journey with these 4 core tenets Capability Driven 1 Right Fit3 Flexible Operating Model4 Heterogenous2 Third Party Tool Integration PwC’s Next Generation Information Architecture 1 2 34 Cloud Interoperability Legacy Integration Data Migration On-Premise Cloud In-Memory Disk based NoSQL typesSupport Model Training Use Cases/ Demad Intake Services Catalog Business Adoption Innovation Platform Monetization Analytics Application Development Enterprise Data Mnagemnet *https://www.pwc.com/us/infoarchitecture
  • 9. 9 Tenet 1: Capability Driven Focus on capturing the current and future information and analytics needs of every business function and external partners to drive the architecture PwC’s Data Lake Capability Framework Data Quality/ Integration 2 Data Architecture 3 Metadata Management 4 Analytics/ Reporting/ Visualization 5 Data Access 6 Security 7 Governance/ Organization 8 1 Data Ingestion Modern data management technologies (ELT based, Data wrangling etc.) used for cleansing, standardizing and integrating the data from multiple internal and external sources leveraging the scalable computing platform Ability to manage and store data in normalized or denormalized structures on disk, in-memory, row vs. columnar vs. column family based data stores (Hive, Spark, HBase, RDBMS etc.) in depending on the use cases Ability to track data sources ingested into the data lake, track data lineage and provenance of storage and processing activities Metrics, Tools and processes required to visualize and comprehend data stored in the data stores in form of reports, dashboards and scorecards for business users Ability to ingest data in batch & real time modes in various forms –Databases, Files, Streams and Queues Centralized and coordinated management of projects/activities, managing change and communication of key milestones and business benefits Capabilities to secure personally identifiable information in the next generation platform and create role based access to business users Ability to access stored data from the Platform through a consistent & secure API
  • 10. 10 Tenet 2: Heterogeneous Hybrid set of both traditional and emerging technologies and platforms to acquire, store, interlock and analyze internal and external data will be the norm going forward. Design for simplicity and iteratively build your modular architecture with transition states towards the target Sources of Known Value Sales Transactions Customer Product Physical Assets Sources of Unproven Value Call Center Social Media Web Clickstream Mobile Interactions Data Ingestion Layer ETL Connectors Sqoop Kafka Flume Emerging – Open Source Illustrative model from a national retailer Emerging – Licensed Traditional – Licensed Licensed+Open Source ETL Match-Merge Services Metadata Management Spark Data Analytics/ Visualization Standardized Reporting On-Demand/ Adhoc Analytics Modeling API based Apps. ELT Relational Schemas Enterprise Data warehouse Data Exchange HDFS RDD HBase Data Wrangling Hive (Parquet) Enterprise Data Lake
  • 11. 11 Tenet 3: Right Fit Enterprises need to develop a decision model which identifies the mix of ‘right fit’ open source as well as commercial solution components, either hosted on the cloud or On Premise, based on functionality and business needs Illustrative On Premise Build ? Buy ? Vendor Dist. ? Constraints ? Base Platform ? End-End Stack ? 3rd party Cloud/Tools? Security? Cloud integration? Pre-Requisites (Hardware, Drivers, Software Interoperability) Cloud Build ? Buy ? 3rd party Cloud/Tools? Security? On Premise Integration? Pre-Requisites (Hardware, Drivers, Software Interoperability) Cloud Vendor ? Vendor Dist. (IaaS)? Which Native Services (PaaS)?
  • 12. 12 Tenet 4: Flexible Operating Model Recognizes the sophistication and analytics maturity at a business function level and enables the required capabilities with the necessary skills, processes, tools and support 1. Business alignment on how Haddoop environment will operate. This includes defining - Services Catalog - Service level Agreements - Tracking Usage, Benefits and Costs - User Onboarding & training 2. Defining the Business architecture - Identify capability areas and opportunities to inform the Big Data Strategy - Use Case Evaluation (risk, feasibility and business case) - Prioritization criteria - Demand / Intake process - Business Roadmap 1. Technology Alignment on how the Hadoop environment will operate. This includes defining: - Access Model (Self service vs. Controlled) - Data acquisition and classification strategy - Organization (Develop vs. Support) - Technical Skills Training 2. Defining the Technology architecture - Architecture Guiding Principles - Leading practices for data acquisition, management and delivery - Reference Architecture with solution patterns for the various use cases - Storage and infrastructure Planning - Security Model Business Operating Model Technology Operating Model
  • 13. 13 Five step strategic approach to build a strong data lake foundation Recognizes the sophistication and analytics maturity at a business function level and enables the required capabilities with the necessary skills, processes, tools and support Capabilities Leveraging client’s stated capabilities and PwC’s Capability framework with business interviews, analytical capabilities are captured and documented1 Use Case Specifications Define success criteria, information sources, dimensionality and information delivery mechanism for each use case. Each Use Case must be mapped to a set of Capabilities2 Platform Architecture & Operating Model Define end-end architecture components (‘lego blocks’) mapped to the capabilities identified with leading practices for ingestion, management , analytics and visualization. Identifies the organization, process and support structure required for agility3 Strategic Roadmap for Execution Organize the initiatives in a sequenced roadmap with scope, duration and dependencies under various themes5 Architecture Patterns Depict the architecture pattern at the use case level , leverages the logical architecture ‘lego blocks’ and also shows the information flow, respective technology component and integration touch point with client’s systems4
  • 14. 14 Case Study # 1 – Financial Services Provider – Risk Modeling for their Loans Portfolio Current State Future State • The client developed a next generation information management and analytics platform which was more business centric with an operating model that enables agility, self service, faster data management and deep analytics for the business stakeholders • Data processing window was reduced from 8-10 hours to less than 30 minutes • Business Users were able to access more granular historical data for ad hoc analysis and analytics models TableauSAS CSV Files No capability to look back history past the last month of data Sources two CSV files (total ~ 3 M rows of data) Aggregation logic performed – CSV data files exported Hadoop Distributed File System TableauHive Spark Aggregation and Data transformation logic performed using HiveQL on 67M records and 36 columns (14.7 GB of data in Hive, 16.3 GB in memory in Spark SQL) Response time between 2s and ~ 1 min per filter sourcing live data via Spark SQL Current Process – Adhoc Analysis – 8-10 hours Future Process – Adhoc Analysis – < 30 minutes • Lack of an integrated architecture and scalable technology infrastructure contributed to data management challenges • The business analytics and modeling teams were looking for more self-sufficiency and process agility • Lacked program leadership and program management discipline specifically for third party services and solution providers • Data Acquisition and management processes lacked a consistent design and architecture and were heavily siloed on an application – application basis Any trademarks included are trademarks of their respective owners and are not affiliated with, nor endorsed by, PricewaterhouseCoopers LLP, its subsidiaries or affiliates.
  • 15. 15 Case Study # 2 – Leading Retail Distribution Company – Trade Promotion Effectiveness 500k SKU’s, 250k customers, 5k suppliers, 6k Fleets Current State • On-premise, rigid infrastructure with serial data processing and limited capacity • Delayed data availability reducing applicability to impactful business decisions • No integration with 3rd party data is causing pain points with vendor collaboration and data access Future State • Flexible, scalable, cloud-based infrastructure enabling multi- stream data processing • Near real-time data availability via Apache Spark data processing providing valuable insights for decision making • Easily supported visualization and reporting platforms accessible by internal and vendors with simple access controls Any trademarks included are trademarks of their respective owners and are not affiliated with, nor endorsed by, PricewaterhouseCoopers LLP, its subsidiaries or affiliates.
  • 16. 16 How is PwC Creating Awareness and Driving Adoption in the Market Thought Leadership / Independent Research Strategic Alliances • Google • Microsoft • Oracle • SAP Data & Analytics @Scale - Client Delivery
  • 17. 17 Closing Thoughts….... • We believe external market forces will propel enterprises to embrace the Data Lake as a foundation of their data, analytics and emerging technology strategies • Although barriers remain for adoption by mainstream enterprises, there are ample opportunities for innovation and acceleration by abstracting sophistication with simplicity and superior end user experience • Enterprises should follow 4 core tenets* while developing their Next Generation Information Architecture Platform • Keep the 5 step strategic ‘capability driven’ approach in mind!! • Thanks for attending the session – please contact us with any questions!
  • 18. © 2016 PwC. All rights reserved. PwC refers to the US member firm or one of its subsidiaries or affiliates, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details.