- Enterprise organizations have legacy solutions as well as emerging solutions
- Optimizing the solution for right audience and right use-cases is critical for adoption across user-base
3. Data Analytics – Basic Concepts
• Business Intelligence
o Using the available data to make factual business decisions
o “WHAT” is happening to your business right now?
• Business Analytics
o Steps that lead up to business decision
o Data Mining - process of looking for trends, patterns, or other useful
information within dataset
o Diagnostic analytics - “WHY” something is happening right now
o Predictive analytics - “WHAT Will” happen in future
o Prescriptive analytics - “WHAT Should be Done next”
4. Enterprise Analytics Landscape
• Enterprises typically have Users categorized broadly as -
o Business users – most interested in current metrics, fiscal trends, dashboards
o Engineering users – most interested in diagnostics (find needle-in-haystack),
deep-analytics
o An enterprise analytics solution stack should cover self-service needs to above
broad user-base
• Existing Data-stores Have Varying Use-cases
o Representing specialized data (application specific)
o Organizational units having independent solutions (IT, Engineering, Support etc..)
o Data architecture demands (BI tool backend, Datamarts, OLTP/OLAP etc)
• Enter Hadoop Datalake…
o Answering “Why” you need Hadoop Datalake in your Analytics landscape is critical
o What short, long term goals need to be met
o Not meant to be a one-stop-shop solution to replace existing Databases and
workflows
o Enterprise has several types of Users (by broad skill level) – A self service solution
stack should cater to broad User base by having mix-of several tools
5. Understanding Existing Data-Stores
Structured
data of Pre-
Computed
measures
Analytical
Cubes
Currently
SQL Server
Business
Analytics
system
Structured
data as Star
schema
with Dims
and Facts
Datamart
Currently
Oracle
Decision
Support
system/
Datamart
Structured,
Semi-
structured
data per
Event
granularity
Hive, M/R,
Datameer
Big Data
system
(Datalake)
Original
data
persisted in
its incoming
form
HDFS(M/R),
NFS
(Scripts),
REST
Raw Data
Highly granular and
complete dataset
Lower granularity and
subset of source data
Good for standard
Biz Metrics of
current and fiscal
trend
Good for interactive
Adhoc reporting
Good for diagnostic
mining and general
Adhoc reports at
scale
Useful to do ELT to
feed into other data
sources
Access
Interface/Tool
Data
Characteristic
6. Advanced Users (Data
Engineers/Scientists)
Enhance and persist
data-model, Develop
Deep insights
workflows
Frameworks, APIs
Map-reduce, Hive, Pig,
Spark, R, Programmatic
(JDBC..)
Technical Analysts
Generate Adhoc and
canned reports
SQL and
Transformation-
workflow based Tools
Oracle, SQL-Server,
Hive, R, Vertica,
Teradata, Datameer,
Tableau, PDI
Exec-users (Non-
Technical)
Consume predefined
metrics, Dashboards,
drag-n-drop what-if
analysis
Visual, Natural
language based tools
Tableau, OBIEE, PBA,
Excel, Microstrategy,
Search UI
End User Categories and Expectations
Usage
Characteristics
Interface
Characteristics
Sample Tools In each
Vertical
7. User and Use-case Requirement Considerations
• Demarcate target Users – Provision right Tool to right Users/Use-cases
– Not all users can should be given a Hadoop Datalake interface in self-service model
– Not one tool can fit all Use-cases
• Get to a Consolidated view of existing Data Sources to cover most
common domain objects to target “BI” based self-service model
• Data architecture - Data-layout and Data-model for the above
“Consolidated view”
– Star-schema vs Analytic Cube vs Flat OLTP schema
– MPP Analytic Database vs OLAP Cube vs DSS
– Traversing and Finding Metadata - Search interface to find entities, attributes and data
– Documentation covering data-model and data-dictionary
• Performance considerations
– High Performance and Concurrency support backend for interfacing BI Tools
– Scalable environment for batch, mining use-cases
– Interactive programmatic platform for data engineering
• Miscellaneous Operational Considerations (slide7)
9. Objectives For Holistic Analytics Platform
• Establish a self-service Analytics platform to cover BI and
Analytics use-cases for Internal users
• Support 3Vs of User types and Access patterns
o Volume of data
o Variety of Users (Programmatic and Non-technical)
o Variety of Queries (Adhoc, Not pre-defined)
o Velocity (Interactive query response, Dashboarding)
• Design Principles
o Embrace ideology of “one-tool doesn’t fit all use-cases and user preferences”
o Ease of Use (Front-end interface and Backend Data-model)
o Improved Performance to query response times
10. Datalake Analytics Platform – Conceptual View
MPP/Analytic
Database
PUAT Datamart Hive HDFS
BI Tool Front-End
Spark
Hue UI
(Hive, Search)
DataStore
Layer
Processing
Engine
Layer
Viz.and
Data
Access
Layer
• Focus on Data Processing & Integration frameworks
• Adhoc Data mining, complex data transformations, Machine learning
• 25-50 Concurrent users
• Focus on Visualization & Metrics (not Data Processing)
• Support Adhoc and Canned Self-service Reports
• 100+ Concurrent users
Extended
Datamodel
Cloudera Search
Spark CLI,
Hive Jdbc
(Programmatic
Access)
Datameer
(Non-
Programmatic)
Engineering focused Self-serve Reporting (Analysts &
Data engineers, Data scientists)
Business focused Self-serve Reporting (Analysts, Execs,
non-technical Audience)
Search
Front-End
11. Datalake Analytics Platform – Technology View
HDFS
(Orig Source)
Spark Data Prep
FW
M/R Daily HDFS
Transforms
HDFS
(Transformed)
Hive/Impala
Time based
SeqFile
Layout
System based
PARQUET
Layout
Adhoc Query
Hue UI/ Edge
Node CLI
Vertica MPP
Analytic DB
(12 month window)
On-demand
Parsed content
Datam
art
Structured
Config Feed
Cloudera Search
Indexing Prep FW
SSAS
Latest System
Snapshot raw
Latest Week Raw
& Structured
Data-
Prep/Transform
(SnapLogic/Data
meer)
Cloudera
Search Hue
UI
Tableau/Penta
ho BA
Spark
CLI/MLLib
Data-Prep/Filter
& Import
(SnapLogic)
DistributedR
Flattened
Star-schema
ZoomData
Raw
Data
Export
Published
Extended schema
Text search & Search AnalyticsSelf-serve BI
Reporting
Statistical Analytics Adhoc SQL Queries On-demand Data Transformations
Other
Sources…
Existing Components
Processing Workflows
New ComponentsOther
Legend
12. Evolving Other Operational Requirements
Agility and Productivity for End users
Monitoring and Governance
- Monitor & recover user, system jobs/service failures
- Analytics on Analytics – user and system behaviour
- Data quality, security etc
Ease of access to Data
- Abstracting data complexities, Provisioning prep’ed data to cover standard use-cases
- Query response times, Data mobility(transfer) issues
Understanding the Dataset
- Documentation, Catalog, Data Dictionary, Data Exploration
Business users (typically from Sales, Product management, Other execs)
Engineering users (Developers, QA, Technical support engineers, Analysts, Data scientists)
User Types:
- Semi/non- technical users – easy to use drag-n-drop interface
- advanced users - Programmatic and SQL based interfaces
Improved Performance considerations
- High Performance and Concurrent platform for user interactions via BI Tools
- Scalable environment for batch, mining use-cases
- nteractive programmatic platform for data engineering
Business users workflows:
- Self-service - Answer “What” questions
- Analytic Database – consolidate data model supporting quick Vizn, Performance and lower learning curve
Engineering users workflows:
- Self-service – Answer “Why” and “What next” questions
CLI – Command-line Interface
MLLib – Machine learning Lib
Data Prep FW – Data Preparation framework
MPP – Massive Parallel Processing
BI – Business Intelligence