This document outlines Oracle's third generation Information Management Reference Architecture. It defines key concepts like the Raw Data Reservoir for storing immutable raw data, and the Foundation Data Layer for standardized enterprise data. It describes logical components like the Data Factory for ingestion and interpretation, and the Access and Performance Layer for enabling queries. It also provides design patterns for different use cases including a Discovery Lab, Information Platform, and Real-Time Event processing. Overall the architecture aims to practically manage all types of data at scale to maximize information value.
4. Introduction
This PPT documents the main architectural components of Oracle‟s
Information Management Reference Architecture.
The architecture is intended to be practical and pragmatic, with many of the
ideas and experiences that inform the approach dating back almost 20 years in
Oracle and are based on real world customer experiences.
We define Information Management to mean the following. Please note that
this definition embraces all types and forms of data as well as embracing
aspects such as Information Discovery and Governance:
“Information Management is the means by which an organisation maximises the efficiency
with which it plans, collects, organises, uses, controls, stores, disseminates, and disposes
of its Information, and through which it ensures that the value of that information is
identified and exploited to the maximum extent possible”
3rd Evolution of Oracle‟s Information Management Reference Architecture
5. Oracle’s Information Management Reference Architecture (3rd Edition)
More relevant to Big Data oriented audience
Better representation of pragmatic customer projects
Includes Raw data store as part of the architecture
Show effort / cost to store and interpret data that separates
schema-on-read and schema-on-write approaches
Aligned to Analytics 3.0
Consistent with Oracle‟s engineering efforts
What‟s changed?
6. Aligning analytical requirements and IM architecture
Enabling Analytics 3.0 with a pragmatic architecture
Analytics 2.0
Analytics 3.0
Analytics 1.0
• Reporting with limited use of
descriptive analytics
• Limited range of tabular data
• Batch oriented analysis
• Analysis bolted onto limited
set of business processes
• Firms “Competing on Analytics”
• Extended analytics to larger
and less structured datasets
• Emergence of Big Data into the
commercial world
• Recognition of Data Science
role in commercial orgs.
• Platform for monetisation
• Deeper analysis & more data
• Faster test-do-learn iterations
• Different types of data & wider
business process coverage
• Analysts focus on discovery and
driving business value
• “Agile” with operational elements
incorporated into design patterns
Adapted from Tom Davenport material
7. Oracle’s Information Management Reference Architecture (3rd Edition)
“All those layers and definitions in your
Reference Architecture, I just don’t get
it… and it looks complicated !”
Hadoop developer knee deep in complex Map:Reduce code
What‟s changed?
Business
Trends
Technology
Trends
Data
Trends
9. Actionable
Events
Event Engine Data
Reservoir
Data Factory Enterprise
Information Store
Reporting
Discovery Lab
Actionable
Information
Actionable
Insights
Input
Events
Execution
Innovation
Discovery
Output
Events
& Data
Conceptual View
Structured
Enterprise
Data
Other
Data
10. Component Outline
Data Engine Respond to R/T events in appropriate and/or optimised fashion
Data Reservoir Raw data Reservoir – typically event data at lowest grain
Data Factory Managed ETL onto, within and between platforms
Enterprise Data Data stores for Information Management
Reporting BI tools and infrastructure components
Discovery Lab Platform, data and tools to support discovery process
Execution – things you do every day
Innovation – innovation to drive tomorrows business
Line of Governance!
Discovery
Output
– Possible outputs include new knowledge, mining models / parameters, scored data…
12. Design Pattern: Discovery Lab
Specific focus on identifying commercial value for exploitation
Small group of highly skilled individuals (aka Data Scientists)
Iterative development approach – data oriented NOT development oriented
Wide range of tools and techniques applied
Data provisioned through
Data Factory or own ETL
Typically separate infrastructure
but could also be unified Reservoir
if resource managed effectively
13. Design Pattern : Information Platform
Build the next generation Information Management platform
Either Business Strategy driven or IT cost / capability driven initiative
Initial project may be specifically linked to lower data grain or retention
BUT it is the platform as a whole that forms the solution required
Platform for consolidating other IM assets onto
Key issues related to differences in
procurement, development process,
governance and skills differences
Discovery Lab may be implemented
as a pragmatic initial POV.
14. Design Pattern : Data Application
Big Data technologies applied to a specific business problem
e.g. Genome sequence analysis using BLAST or log data from
pharmaceutical production plant and machinery required for traceabiliy
Limited or no integration to broader Information Management estate
Specific solution so Non-functional requirements have less impact
on solution quality or long term costs
Platform costs and scalability are
important considerations
15. Design Pattern: Information Solution
Specific solution based on Big Data technologies requiring broader
integration to the wider Information Management estate
e.g. ETL pre-processor for the DW or affordably store a lower level of grain
Non-functional requirements more critical in this solution
Scalable integration to IM estate
an important factor for success
Analysis may take place in Reservoir
or Reservoir only used as an aggregator
16. Design Pattern: Real-Time Events
May take place at multiple locations between place of data origination and the
Data Centre – requiring careful design and implementation
May include Next-Best-Activity, declarative rules and Data Mining technologies
to optimise decisions. i.e. optimise across declarative, data mining, customer
preference & business-defined rules
May include considerations for
personal preferences and privacy
(e.g. opt-out) for customer related
events
Common component seen across
many industries & markets
e.g. connected vehicle
Real-Time optimisation of events
17. Design Pattern against component usage map
Design pattern Discovery Lab
Information
Platform
Data Application Information Solution R/T Events
Outline
Data science lab
Assess the value of
the data
Next Generation
information platform to
align IM capability with
business strategy
Addressing a specific data
problem in Hadoop with no
broader integration required.
Addressing a specific data
problem but requires broader
enterprise wide integrations. e.g.
ETL pre-processing, Event Store
at lower grain than existing DW
Execution platform to
respond to R/T events
Examples
Gov. Healthcare
Mobile operator
Spanish Bank (Business led)
UK Gov. Dept. (Tech. led)
Pharma Genome project
Pharma production archive
Investment Bank – trade risk
Mobile Operator – ETL processing
Mobile operator –
location based offers
Data Engine Possible Yes
Data Reservoir Yes Yes Yes
Data Factory Yes Yes Yes
Enterprise Data Yes
Reporting Yes
Discovery Lab Yes Implied Alternative approach
to Reservoir + Factory above
19. Information Management – Logical View
Data Sources
Data Ingestion
Methods and process
to load data into our
managed data store
and manage data
quality
• Contemporary Information Management solutions must be able to ingest any type of data from any source in any format and
mechanism and at any frequency. e.g. Flat file loads, streaming…
• The data may be highly unstructured, mono-structured or highly poly-structured.
• Data will vary in volume and in Data Quality.
• Operational isolation should be considered to ensure operational applications will continue in the event of the loss of the
Information Management system
Data Engines &
Poly-structured
sources
Content
Docs Web & Social Media
SMS
Structured
Data
Sources
• Operational Data
• COTS Data
• Master & Ref. Data
• Streaming & BAM
20. Information Management – Logical View
Information Ingestion
Data Ingestion
Information Interpretation
Methods and process
to load data and
manage Data
Quality
Methods and
process needed to
access information
Managed Data
Load
All data under management
Query
• Data structure and processing required to load data into managed data stores
• Shape represents the work done on the data to load data and/or process between layers
• Layer may include file mechanism where required to facilitate loading
(e.g. Fuse fs or ZFS for operational isolation and file concat)
• Normal rules of micro-batch, taking all the data and KISS principles recommended
• DQ and loading stats presented through BI dashboards as a non-judgemental mechanism to improve DQ.
• Data may be landed in the Ingestion layer to facilitate loading but not typically stored for any length of time. e.g. Raw data loaded from web
logs but sessionised data then loaded to Raw. Another example is data used to manage CDC may be stored in this layer.
21. Information Management – Logical View
Data Interpretation
Data Ingestion
Information Interpretation
Methods and process
to load data and
manage Data
Quality
Methods and
process needed to
access information
Managed Data
Load
All data under management
Query
• Methods and processes required to access information in each of the stores
• Shape represents the cost of interpreting the data under management
• For schema-on-read the cost may include the AVRO, SerDe or reader class as well as the associated processing code to
select, filter and process the data.
• For schema-on-write the cost is represented by the complexity of the SQL required to access the data only – more complex
typically for 3NF than for a dimensional query.
22. Information Management – Logical View
Data Layers – cost, quality and concurrency trade off
Managed DataAccess & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Immutable raw data reservoir
Raw data at rest is not interpreted
Immutable modelled data. Business
Process Neutral form. Abstracted
from business process changes
Past, current and future interpretation of
enterprise data. Structured to support
agile access & navigation
• Increasing enrichment
• Increasing data quality
• Reducing concurrency costs
• Data under management includes 3 key layers – Raw, Foundation and Access and Performance layers.
• Data normally loaded into Raw and Foundation layers BUT BI Apps loads data directly into APL and federated warehouses may
well also load data at aggregate level from federated operating companies.
• Data Factory is responsible for loading and then managing data between layers.
• Work is done to elevate the data between layers – typically further enriching and improving data quality.
• Work done in processing the data between the layers significantly reduce query costs. i.e. higher levels of concurrency can be
sustained for the same processing power.
• Increasing formalisation of definition
23. Information Management – Logical View
Data Layers – Analytical processing
Managed DataAccess & Performance Layer
Foundation Data Layer
Raw Data Reservoir
• Analytical processing capabilities of Hadoop and RDBMS used to elevate data between layers as previously described.
• These analytical capabilities can also be leveraged by tools that access the data directly.
Typically this would be by a Data Scientist for Discovery Lab operations or BI Tools and Services that are processing data using
a model previously defined by the Data Scientist.
OLAP
Data Mining
Statistics
OLAP
Text Mining
Other
Analytical
Processing
Data Mining
Text Mining
Image
Processing
• Increasing enrichment
• Increasing data quality
• Reducing concurrency costs
• Increasing formalisation of definition
24. Information Management – Logical View
Data Layers – Raw Data Reservoir
Managed DataAccess & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Immutable raw data reservoir
Raw data at rest is not interpreted
Immutable modelled data. Business
Process Neutral form. Abstracted
from business process changes
Past, current and future interpretation of
enterprise data. Structured to support
agile access & navigation
• Immutable data store with data at lowest level of grain.
• Typically implemented in Hadoop or NoSQL for cost reasons but not always.
• May be:
• Queries directly,
• Used to derive base level data for Foundation Layer. Data may be represented logically in Foundation or physically as the
store is immutable BUT this effects ILM policy.
• or used to derive values or aggregates for Access and Performance layer. (e.g. propensity score or total monthly SMS‟s)
25. Information Management – Logical View
Data Layers – Foundation Data Layer
Managed DataAccess & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Immutable raw data reservoir
Raw data at rest is not interpreted
Immutable modelled data. Business
Process Neutral form. Abstracted
from business process changes
Past, current and future interpretation of
enterprise data. Structured to support
agile access & navigation
• Immutable integrated and standardised store of enterprise class data. Stuff the business has agreed and organises around.
• Data at lowest level of grain of value for Enterprise data.
• Stored in business process neutral fashion to avoid data maintenance tasks to keep in step with current business interpretations.
• Typically close to 3NF. Special attention to modelling hierarchy, flexible entity attributions, customer / supplier etc.
• ONLY implemented in relational technology BUT this could be logical as previously noted in Raw Data Reservoir.
• May be queries directly by a select few individuals. Wider access to detail data provided through views in APL, often with VPD
implemented to prevent queries to antecedent data.
• Data in the Foundation Layer should be retained for as long as possible.
• Consideration should be given to retaining data in Raw Data Reservoir rather than archiving.
26. Information Management – Logical View
Data Layers – Access and Performance Layer
Managed DataAccess & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Immutable raw data reservoir
Raw data at rest is not interpreted
Immutable modelled data. Business
Process Neutral form. Abstracted
from business process changes
Past, current and future interpretation of
enterprise data. Structured to support
agile access & navigation
• Layer facilitates access, navigation and performance of queries.
• Allows for multiple interpretations of data from Foundation or Raw data Reservoir.
• Most structures can be thrown away and re-built from scratch based on Foundation and Raw Reservoir.
• The exception is derived and aggregate data which may have to be retained if the underlying data/mechanism is archived.
• Most users presenting information in a standardised fashion on dashboards and reports will access this layer only.
27. Access and Performance Layer
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Data Engines &
Poly-structured
sources
Content
Docs Web & Social Media
SMS
Structured
Data
Sources
• Operational Data
• COTS Data
• Master & Ref. Data
• BAM Data
• Data destined for Raw Data Reservoir may be loaded directly (e.g. through Flume) or may be stored temporarily in fs prior to
loading (e.g. Fuse fs)
• Relational data ingested in most appropriate mechanism before persisting in Foundation Data Layer (usual rules apply…)
• Ideally micro batch using simplest mechanism possible
• Only data of agreed quality loaded in FDL
• For efficient loading relationally data may be pre-staged in fs so a large number of small files can be concatenated
Information Management – Logical View
Data Factory Ingestion flow
Data Ingestion
Batch & Real-Time
ETL / ELT
CDC
Stream
File Ops.
28. Access and Performance Layer
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Flow shown:
1. Data to be formalised from HDFS store extracted and loaded into Foundation Data Layer.
e.g. where Flume/HDFS is being used as an ETL pre-processor for Enterprise Data
or where HDFS data is being logically modelled in the foundation layer
2. Data is re-structured and/or aggregated to facilitate access by users and business processes
3. Data may also be re-structured and/or aggregated from HDFS store where there are no specific
requirements to manage Enterprise Data in a more formal data store over time
1
2
3
Information Management – Logical View
Data Factory intra data processing flow
29. Access and Performance Layer
Information Management – Logical View
Information Provisioning – BI & Data Science Components
Virtualisation&
QueryFederation
Enterprise
Performance
Management
Pre-built &
Ad-hoc BI Assets
Information
Services
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Virtualisation&
QueryFederation
• Data Virtualisation and the various components to access the data are as per our previous view on BI tools.
• Data Virtualisation is a key components that helps to deliver tools independence, services integration and a future state roadmap
• Big Data has focused considerable attention on Data Science
• Analytical capabilities delivered through analytical processing in the data layers and Advanced Analytical Tools used to drive capabilities
• Data Mining in particular often involves complex data processing to flatten data into a longitudinal form. This derived data and model results
are typically written to a project based sandbox.
• Agile discovery is often best served through a separate Discovery Lab infrastructure (see later details)
Data Science
30. Access and Performance Layer
Information Management – Logical View
Information Provisioning BI Flows
Virtualisation&
QueryFederation
Enterprise
Performance
Management
Pre-built &
Ad-hoc BI Assets
Information
Services
Data Science
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
2
3
1. Typical access mechanism for Enterprise data via Access and Performance layer structures
2. Access to Foundation Layer Data to specific functions, processes and users only
3. Data interpretation & DQ assured through encoded logic, Avro, SerDe, FileReader, HCat etc.
4. Diagonal flows shows how data can be joined between layers as well as accessed directly. e.g. Raw Data
can be queried directly through HIVE connector or joined to the RDBMS data and queried.
1
4
4
31. Information Management – Logical View
Data / Information Quality
Access and Performance Layer
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Virtualisation&
QueryFederation
Enterprise
Performance
Management
Pre-built &
Ad-hoc BI Assets
Information
Services
Data Science
Quality of data at rest assured by a number of factors in addition to the underlying quality of data at source
– File and event handling to ensure data is not missed (e.g. missing log files assured by log file sequence numbering)
– The processing of data between Raw and FDL / APL layers. This can be seen as a DQ firewall to ensure only data of known and
acceptable quality is loaded. Typically this involves an element of synchronisation as some data will need to be held off until required
reference data is available due to the micro-batch incremental loading approach.
Quality of information presented to downstream tools and services determined by
– Model quality, understanding and performance of provisioning from modelled layers
– Consistency of definition, code quality and query performance when accessing Hadoop data (e.g. HR code, Avro definition…)
32. Information Management – Logical View
Data Reservoir & Enterprise Information Store
Virtualisation&
QueryFederation
Enterprise
Performance
Management
Pre-built &
Ad-hoc
BI Assets
Information
Services
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Data
Science
Data Engines &
Poly-structured
sources
Content
Docs Web & Social Media
SMS
Structured
Data
Sources
• Operational Data
• COTS Data
• Master & Ref. Data
• Streaming & BAM
Immutable raw data reservoir
Raw data at rest is not interpreted
Immutable modelled data. Business
Process Neutral form. Abstracted
from business process changes
Past, current and future interpretation of
enterprise data. Structured to support
agile access & navigation
34. Analysis Processing & Delivery
Discovery Lab & Data Science Tooling
Data Reservoir & Enterprise Data
Data
Science
(Primary
Toolset)
Statistics Tools
Data & Text Mining Tools
Faceted Query Tools
Programming & Scripting
Data Modeling Tools
Query & Search Tools
Pre-Built
Intelligence
Assets
Intelligence
Analysis
Tools
Ad Hoc Query
& Analysis Tools
OLAP Tools
Forecasting &
Simulation Tools
Reporting Tools
Data
Scientist
Virtualisation&
InformationServices
Data Factory
flow
1. Data Factory responsible for
access provisioning to data
or replication (all or sample)
to Sandbox in Discovery Lab.
2. Direct connection from Data
Science tools and analysis
sandbox. Data Science tools
read and write data from/to
project sandboxes.
3. Data Scientist can also
access standard dashboards,
reports and KPI‟s through
Data Virtualisation layer
Data Quality & Profiling
Graphical rendering tools
Dashboards & Reports
Scorecards
Charts & Graphs
Sandbox – Project 3
Sandbox – Project 2
Sandbox – Project 1
1
2
Data store
Analytical
Processing
3
Information Management – Logical View
Discovery Lab data flow
36. Real-time
Data Engine
To Event Subscribers
(Events / Data)
Privacy Filter
Data Transform
Rules & Models
Mediation
Next Best Action
Real-Time
Data Store
From Input Events
Reference
Data
Models
& Rules
Privacy
Data
Analytics
Real-Time Data Engine – Logical View
Business Activity Monitoring
Real-Time event
monitoring
37. Real-Time Data Engine
Message mediation service
Privacy filter for event data. i.e. apply customer specified privacy
and preference filters to the data stream
Transformation of the message data to outbound form
Apply declarative rules and models to the data stream to detect
events for further downstream processing
Next Best Activity (NBA) event detection and processing. NBA
typically also includes control group management and global
optimisation of rules
Business Activity Monitoring
Local data store – local persistence of rules and metadata
Components
Privacy Filter
Data Transform
Rules & Models
Mediation
Next Best Action
Real-Time Data
Store
BAM
38. Real-Time data engine flows
Describe each of the data flows
Reference
Data
Models
& Rules
Privacy
Data
Event Analytics
From Input Events
To Event Subscribers
(Events / Data)
R/T Event Monitoring
To Do
41. Information Management Reference Architecture
Interpretation layer
shows the relative cost
of reading data
depending on its
location
Previous staging layer
now split into Data
Ingestion and Raw
store.
Ingestion layer
includes methods and
processes to load data
and manage Data
Quality. Shape
represents the relative
cost of these
processes. i.e. from
none for HDFS to lots
in APL.
Raw Reservoir is
typically at the lowest
level of grain. Often
lower than the
enterprise cares about
and so may not have
been included in
previous
representation.
Renamed from
Knowledge Discovery
to Discovery Lab but
otherwise unchanged.
The role of Discovery
Labs is becoming
more central though so
additional operational
guidance will be
added.
Discovery Lab
Still an immutable
store but may be
physically
implemented in
relational or non-
relational technologies
Key differences from 2.0 to 3.0 of the Architecture
45. Data discovery for the Enterprise
Discovery phase
– Unbounded discovery
– Self-Service sandbox
– Wide toolset
– Agile methods
Promotion to Exploitation
– Commercial exploitation
– Narrower toolset
– Integration to operations
– Non-functional requirements
– Code standardisation &
governance
Discovery and monetising steps have different requirements
Business
Value
Commercial
Exploitation
Time / Effort
Discovery phase
Understanding
of the data
Governance
46. To monetise fully you need to standardise
It‟s smart to standardise as part of Governance
Discovery process requires
a broad toolset
Standardisation is essential
for Commercial exploitation
Sustainability depends on
standardisation / rationalisation
– Reduced training burden
– Reduced support costs
– Reduced license costs
– Ongoing agility & alignment
Data Discovery Toolset Data Exploitation Toolset
Rationalised
Components
• Cloudera CDH, Oracle, No-SQL
• Mammoth, Yarn, EM-plugin
• MR, Hive, Pig, Impala, Accum.
• Flume NG, Oozie
• …
• …
• …
Optional additions
• Oracle Connectors
• Additional corporate standard
components
Oraclestandard
deployment
Corporate
standard
Standardised Hadoop Zoo
Standardised deployment
The closer you are to monetising data the more organised the data should beHadoop minimises the penalty for not being organised i.e. not understanding your dataData ManagementData ProfilingDescriptive statisticsGraphical Analysis
Many of our customer have already developed Hadoop based solutions in a pre-production setting by downloading from internet and running it on a virtualised Linux server, often on a laptop.
If the audience is very pro Big Data lay on the first explanation thick – talk about TRADITIONAL systems and how ETL can be very slow to put into place because of the need to agree the process with the business, build a common understanding of data and how it must be integrated etc.Schema on read is the opposite – it is very fast to value BUT the cost of ETL is carried by each system that accesses the data. Data quality is a function of the program that accesses the data.Time also has a bearing here. Use the example of the recent changes to Hadoop and the deprecation of large numbers of JAVA classes
Ken was also at Zinga also ex SiebelHis point about the way they have included Analysts in their product teams is a key one regards Analytics 3.0Also in Zinga more than 50% of the data was held in flex fields – it’s a shame nobody told them how to model this kind of system!The closer you are to monetising data the more organised the data should beHadoop minimises the penalty for not being organised i.e. not understanding your data