2. RIGHTFOCUSANDONTARGET
Agenda
Analyze &
Define
• Progression of Analytics
• The new phenomenon - Big Data
• Big Data Defined
Technology
Discussion
• Big Data Technology – Hadoop
• Big Data – Big Savings – Hadoop
Use Cases
• What can we solve with Big Data – example
• What is next ? Where are the opportunities
3. RIGHTFOCUSANDONTARGET
Progression of Analytics
Structured – Known
Data
Traditional – ETL, Data Marts,
DW, RDBMS
Growth – Normal
Incremental – Archive
Less Cross Functional Integration
More Tactical than
Strategic
Sizes GBs to TBs
Data Architects vs.
Functional
So Far…..
6. RIGHTFOCUSANDONTARGET
The new phenomenon - Big Data
1. No to “fit-for-all” but Yes to “fit-for-purpose”
2. Proliferation of data sources – variety of data
3. Proliferation of volume of data
4. The demand for the speed (velocity) of data
5. Demand for high value & accuracy ( veracity)
of info
6. Massive Parallel processing
7. Commodity servers vs. Specialized servers
DATA DRIVEN BUSINESS
is
THE SMART BUSINESS
7. RIGHTFOCUSANDONTARGET
Big Data Definition
• High volume of
data which is
growing every
year more than
50 % every year
• High Speed
Streaming,
Machine
generated data
etc
• Different Data
sources In-the-
enterprise and
external data
around the
enterprise data
• Data collected
taking huge
memory (typically
100 TB or more)
where RDBMS is
inefficient
Value Variety
VolumeVelocity
VERACITY
Meaningful
8. RIGHTFOCUSANDONTARGET
Big Data Definition
VERACITY
Big Data is the new art and science, using Massive
Parallel Processing (MPP) technology, of
collection, storage, processing, distribution, and
analysis of data with any of the attributes – high
volume, high velocity, high variety to extract high
value and greater accuracy (veracity).
IBM Says, BIG DATA means
1.Volume (Terabytes --‐> Zettabytes) 2. Variety (Structured --‐>
Semi--‐structured --‐> Unstructured)
3. Velocity (Batch --‐> Streaming Data)
9. RIGHTFOCUSANDONTARGET Big Data Technologies – Typical Stack
Big Data Infrastructure
Data Manipulation & Management
Data Analysis & Mining
Predictive & Prescriptive Analysis
Process Automation& Decision Support Systems
Big Data Stack
10. RIGHTFOCUSANDONTARGET Big Data Technologies – SMAQ
User-friendly Analytics
1. PIG ( simple Query Language), 2. HIVE ( Similar to SQL)
3. Cascading ( Workflow) 4. Mahout ( Machine Learning)
5. Zookeeper (Coordination Service)
Data Distribution & Management across nodes in Batch Mode
1. Hadoop MapReduce
2. Alternative – BashReduce, Disco Project, Spark, GraphLab (C&M),
Strom, HPCC (LexisNexis)
Distributed Non-Relational
1. HBase ( columnar DB)
2. HDFS – Hadoop Distributed File System
Query
Map Reduce
Storage
SMAQ Stack
11. RIGHTFOCUSANDONTARGET
Big Data – Big Savings – Economics
ROI on Big Data Approach (with Hadoop)
Source : American Institute for Analytics
1TB of RDBMS TCO
$37,000 - Traditional RDBMS
$2,000 only !!!! Hadoop
Source :American Institute for Analytics
12. RIGHTFOCUSANDONTARGET
Where is the market on Big Data
Infrastructure / Framework / Analytics software
Horizontal Solutions like EDW etc
HealthCare
RetailIndustry
Government/
Publicsector
Education&
HumanCapital
HealthSciences
/Genomics
Telecommunicat
ions/Services
Energy&
Utilities
E-Commerce/
Marketing
Media&
Entertainment
Source: IDC 2011
0
5
10
15
20
2010 2011 2012 2013 2014 2015
Big Data Market In $B
Current
State
14. RIGHTFOCUSANDONTARGET
Web Logs
Images &
Videos
Social Media
Documents
Structured
Data
Big Data /
Hadoop
etc.
Prescriptive
Predictive
Reporting
OLAP
Modeling
Pure Big data Implementation - Architecture
Pure Big Data
Connectors
/ Adapters
Barriers
Disruption to existing Analytics ?!
Roadmap / Methodology
Certainty of costs
HADOOP / Big Table can replace traditional EDWs !!
18. RIGHTFOCUSANDONTARGET
BIG Data Opportunities
Some Gaps & opportunities
•Real-time Analysis ( may be use SAP HANA etc !!)
•User interface (UI) frameworks
•App development Big Data on Cloud (multi-Tenancy)
•Security & Data Governance
•Cross Application Integration
•Industry Standards
20. RIGHTFOCUSANDONTARGET
Business Focus
Identify data needs
Identify Business Issues
Layout data dependencies
between functions
Resolve Competing priorities
Clearly lay out the levels of
data, cross-functional
requirements
Stakeholder Focus
Identify the stake holders
Align best practices with the
project
Plan out the objectives, scope,
and timelines
Identify the KPIs, Reports,
Dashboards, Predictive &
Prescriptive Analysis to be delivered
Technology Focus
Synergies in current technology
Take stock of existing “technology
assets” towards Big Data
Assess your current capabilities and
architecture
Identify the resources and minimize
“specialties” to exploit synergies with
existing resource pool
Lay out a development methodology
to streamline delivery
Process Focus
Establish clear data flows
Identify Data Governance
execution process – People,
Processes, Mechanisms
Design the process to be more
Business focused than IT
Clearly establish measures to
achieve – Accuracy, Repeatability,
Agility, and accountability (
reconcilability)
Our Big Data Strategy at a glance
21. RIGHTFOCUSANDONTARGET
Our Execution Approach – AGILE methodology
Agile Approach to reduce risks
• Close coordination
between the customer and
the developer
• Small incremental steps
makes testing easier and
manageable & avoid
surprises
• Early recovery from
expectation mismatch
• Clarity on Design
understanding and regular
communication with user.
• Early warning about risks
regular status reports.
• Full Knowledge Transfer
Progression of Analytics 3 minutes
The new phenomenon - Big Data 4 minutes
Big Data Defined 3 minutes 2 minutes
Where is the Technology 5 minutes
What can we solve with Big Data – example Case Studies 5 minutes
What is next ? Where are the opportunities ? 10 minutes
Internal Information –Known questions and answers - Known structures, structured data types, known volumes, mostly transactional data
Master data is very well defined - Storage Typical Data Warehouses, Data Marts using batch processing & traditional ETL, and relational databases
Data growth is incremental and regular archival
Just reporting, a little bit of mining – mostly descriptive - predictive analysis is very light
Cross functional integration of data is very limited, very structured around customers, services & products, logistics etc.
Functional & Technical responsibilities are very clearly demarcated. Mostly Data engineers / architects at the backend supporting business analysts / users.
Most of the reports are just a measurement of their tactics – more supporting the strategy than inducing a strategy
Data sizes are in Giga and Terra byte range , becomes inefficient and costly after a certain size limit.
Narrow & focused business missions – not “fit-for-all” but “fit-for-purpose”
The need to discover more - Facts, Relationships, Indicators, Patterns, Trends, Pointers which could not probably be discovered before by using cross integration of data from various sources
Need to capture & store data and just not collect
Proliferation of data sources – variety of data
Multi-Dimensional Data Streaming Data Geo Spatial Data
Social Networking Data Internal Data (RDBMS) Video & Image data
Text data (logs etc) Time series Data Genomics
Proliferation of volume of data ( crossed to Petabytes and above)
Internet / intranet Social networks ( FB & Twitter) Mobile Devices
Smart Home devices Smart systems (Utilities etc) Media & entertainment
The demand for the speed (velocity) of the data collected, understood, processed, and distributed
Accessibility - where when, who, and how Time value – Real Time or not
Increased speeds of consumption Increased speeds of data generation
Demand for high value & accuracy ( veracity) of information
Advent of Technology with Massive Parallel processing - Availability of Hadoop / Map reduce kind of open source & packaged technologies
Affordability of infrastructure – Commodity servers vs. Specialized servers
Hadoop enables a computing solution that is:
Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
The word of the hour is “SMART” !!
Smart Business – Targeted value proposition
Businesses are under pressure to maximize their investments
( focused approach, not one-fit-all methodology)
Targeted value proposition Targeted advertisement, Tailored menu, Focused Initiatives, Individualized Attention, Non-impersonal Messaging, Efficient Governance, Greater Accuracy
Narrow & focused business missions – not “fit-for-all” but “fit-for-purpose”
The need to discover more - Facts, Relationships, Indicators, Patterns, Trends, Pointers which could not probably be discovered before by using cross integration of data from various sources
Need to capture & store data and just not collect
Proliferation of data sources – variety of data
Multi-Dimensional Data Streaming Data Geo Spatial Data
Social Networking Data Internal Data (RDBMS) Video & Image data
Text data (logs etc) Time series Data Genomics
Proliferation of volume of data ( crossed to Petabytes and above)
Internet / intranet Social networks ( FB & Twitter) Mobile Devices
Smart Home devices Smart systems (Utilities etc) Media & entertainment
The demand for the speed (velocity) of the data collected, understood, processed, and distributed
Accessibility - where when, who, and how Time value – Real Time or not
Increased speeds of consumption Increased speeds of data generation
Demand for high value & accuracy ( veracity) of information
Advent of Technology with Massive Parallel processing - Availability of Hadoop / Map reduce kind of open source & packaged technologies
Affordability of infrastructure – Commodity servers vs. Specialized servers
Hadoop enables a computing solution that is:
Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
Targeted advertisement, Tailored menu, focused initiatives, individualized attention, non-impersonal messaging, efficient governance, greater accuracy
Businesses want to gain competitive advantage by being able to take action based on timely, relevant, complete, and accurate information, rather
than one-fit-for all solutions
There is immense volume, variety and velocity of data that is produced today is new information, facts, relationships, indicators and pointers, that either could not be practically discovered in the past, or simply did not exist before
Targeted advertisement, Tailored menu, focused initiatives, individualized attention, non-impersonal messaging, efficient governance, greater accuracy
Businesses want to gain competitive advantage by being able to take action based on timely, relevant, complete, and accurate information, rather
than one-fit-for all solutions
There is immense volume, variety and velocity of data that is produced today is new information, facts, relationships, indicators and pointers, that either could not be practically discovered in the past, or simply did not exist before
Market has just started picking up
There is a lot of gap in vertical solutions
Biggest gap is in Big Data Services
Hardware & Software components seem to have been available already
Adapting to Real-time Analysis ( may be use HANA !!)
Development of industry standards
Development of Universal Schema for metadata and cataloging
Tools to support security & data governance
Support for Cloud-ification (multi-tenancy)
Support for data lineage
Framework for cross-application integration
Support for testing
Automated & configurable monitoring and management console
User interface (UI) frameworks
Business Focus
Identify data needs for strategic business functions
Identify Business Issues that need to be solved by big Data
Layout data dependencies between functions
Resolve Competing priorities
Clearly lay out the levels of data, cross-functional requirements
Technology Focus
Identify the right technology to align with the current landscape for synergies in technology
Take stock of existing “technology assets” towards Big Data
Assess your current capabilities and architecture to support your goals, and select the deployment strategy that best fits your Big Data questions
Identify the resources and minimize “specialties” to exploit synergies with existing resource pool
Lay out a development methodology to streamline delivery
Stakeholder Focus
Clearly identify the stake holders at all levels of data consumption
Present best practices and align them with the project
Plan out the objectives, scope, and timelines
Identify the KPIs, Reports, Dashboards, Predictive & Prescriptive Analysis to be delivered
Process Focus
Establish clear data flows from collection of data to consumption of data
Identify Data Governance execution process – People, Processes, Mechanisms
Design the process to be more Business focused than IT
Clearly establish measures to achieve – Accuracy, Repeatability, Agility, and accountability ( reconcilability)