Red Hat's document discusses using JBoss Data Virtualization to gain better insights from big data. It describes challenges with existing data integration approaches as data sources grow in size, type and location. Red Hat's big data strategy is to reduce the information gap by making all data easily consumable for analytics. JBoss Data Virtualization software virtually unifies data across sources and exposes it to applications through standard interfaces. The demonstration shows integrating social media sentiment data from Hadoop with sales data from MySQL to analyze movie ticket and merchandise sales.
1. GAIN BETTER INSIGHTS FROM BIG DATA
USING RED HAT JBOSS DATA VIRTUALIZATION
Red Hat Corporation
January 5, 2014
2. Red Hat is…
“By running tests and executing numerous examples for specific teams, we were able to prove […] not
only would the solution work, but it will perform better & at a fraction of the costs.”
MICHAEL BLAKE, Director, Systems & Architecture
2
RED HAT Confidential
3. Agenda
●
Data challenges getting bigger
●
Red Hat Big Data Strategy and Platform
●
Data Virtualization Overview
●
Customer Use Case for Big Data integration using Data
Virtualization
●
●
3
Demo
Q&A
RED HAT Confidential
4. Data Driven Economy
Data is becoming the new raw material
of business: an economic input almost
on a par with capital and labor. “Every
day I wake up and ask, ‘how can I flow
data better, manage data better,
analyze data better?”
CIO - Wal-Mart
4
RED HAT Confidential
5. Data Challenges Getting Bigger
Big Data, Cloud, and Mobile
Existing Data Integration approaches are not sufficient
●
Extracting and moving data adds latency and cost
●
Every project solves data access and integration in a different way
●
Solutions are tightly coupled to data sources
●
Poor flexibility and agility
BI Reports
Operational
Reports
Enterprise
Applications
SOA
Applications
Mobile
Applications
Constant
Change
How to align?
Integration Complexity
Siloed &
Complex
Hadoop
5
NoSQL
Cloud Apps
Data Warehouse
& Databases
Mainframe
RED HAT Confidential
XML, CSV
& Excel Files
Enterprise Apps
6. Business Objective
Turn Data into Actionable Information
Only
28%
Users have any meaningful
data access
Reduce costs for finding and
accessing highly fragmented data
Over
70%
BI project efforts lies in the
integration of source data
Improve time to market for new
products and services by simplifying
data access and integration
Deliver IT solution agility
necessary to capitalize on constantly
changing market conditions
Transform fragmented data into
actionable information that delivers
competitive advantage
6
RED HAT Confidential
7. Red Hat’s Big Data Strategy
●
Reduce Information Gap thru cost effectively making
ALL data easily consumable for analytics
Process
Data to Actionable Information Cycle
7
RED HAT Confidential
Analytics
Data
Capture
Integrat
e
8. Red Hat Big Data
Platform
Middleware
Hadoop
Integration
JBoss Data
Virtualization
Platform
RHEL
Platform Integration
&
Optimization
op
ado
H n
o
ra
Apache
edo
F
Fedora
Big Data SIG
Hadoop
Hadoop
Distributions
Hadoop On
Red Hat Storage
Storage
8
RED HAT Confidential
Hadoop
On
OpenStack
Cloud /
Virtualization
9. Red Hat Big Data
Platform
Platform
RHEL
Platform Integration
&
Optimization
Middleware
Hadoop
Integration
JBoss Data
Virtualization
p
doo
Ha n
o
ora
Apache
Fed
Fedora
Big Data SIG
Hadoop
Hadoop
Distributions
Hadoop On
Red Hat Storage
Storage
9
RED HAT Confidential
Hadoop
On
OpenStack
Cloud /
Virtualization
10. What does Data Virtualization software do?
Turn Fragmented Data into Actionable Information
Data Virtualization software virtually
unifies data spread across various
disparate sources; and makes it
available to applications as a single
consolidated data source.
DATA CONSUMERS
BI Reports
The data virtualization software
implements 3 steps process to bridge
data sources and data consumers:
●
●
●
10
Connect: Fast access to data from
diverse data sources
Compose: Easily create unified
virtual data models and views by
combining and transforming data
from multiple sources.
Consume: Expose consistent
information to data consumers in
the right form thru standard data
access methods.
SOA Applications
Easy,
Real-time
Information
Access
Virtual Consolidated Data Source
Data Virtualization Software
•
•
•
Consume
Compose
Connect
Oracle DW
SAP
Hadoop
DATA SOURCES
RED HAT Confidential
Salesforce.com
Virtualize
Abstract
Federate
Siloed &
Complex
11. Turn Fragmented Data into Actionable Information
Mobile Applications
ESB, ETL
BI Reports & Analytics
SOA Applications & Portals
Data
Consumers
JBoss
Data
Virtu
aliza
tion
Design Tools
Standard based Data Provisioning
JDBC, ODBC, SOAP, REST, OData
Consume
Dashboard
Unified Virtual Database / Common Data Model
Compose
Unified Customer
View
Unified
Product View
Easy,
Real-time
Information
Access
Unified
Supplier View
Optimization
Caching
Virtualize
Abstract
Federate
Security
Connect
Native Data Connectivity
Metadata
Data
Sources
Siloed &
Complex
Hadoop
11
NoSQL
Cloud Apps
Data Warehouse
& Databases
RED HAT Confidential
Mainframe
XML, CSV
& Excel Files
Enterprise Apps
12. JBoss Data Virtualization:
Supported Data Sources
Enterprise RDBMS:
• Oracle
• IBM DB2
• Microsoft SQL Server
• Sybase ASE
• MySQL
• PostgreSQL
• Ingres
Enterprise EDW:
• Teradata
• Netezza
• Greenplum
12
Hadoop:
• Apache
• HortonWorks
• Cloudera
• More coming…
Office Productivity:
• Microsoft Excel
• Microsoft Access
• Google Spreadsheets
Specialty Data
Sources:
• ModeShape
Repository
• Mondrian
• MetaMatrix
• LDAP
RED HAT Confidential
NoSQL:
• JBoss Data Grid
• MongoDB
• More coming…
Enterprise & Cloud
Applications:
• Salesforce.com
• SAP
Technology
Connectors:
• Flat Files, XML Files,
XML over HTTP
• SOAP Web Services
• REST Web Services
• OData Services
13. Key New Features and Capabilities
●
Data connectivity enhancements
–
–
NoSQL (MongoDB – Tech Preview) and JBoss Data Grid
–
●
Hadoop Integration (Hive – Big Data),
Odata support (SAP integration)
Developer Productivity improvements
–
–
Enhanced column level security,
–
●
New VDB Designer 8 and integration with JBoss Developer Studio v7
VDB import/reuse, and native queries
Simplify deployment and packaging
–
–
●
Requires JBoss EAP only; included with subscription
Remove dependency with SOA Platform
Business Dashboard
–
13
New rapid data reporting/visualization capability
RED HAT Confidential
14. ●
JBoss Data Virtualization – Use Cases
Self-Service
Business
Intelligence
The virtual, reusable data model provides business-friendly representation of data,
allowing the user to interact with their data without having to know the complexities of their
database or where the data is stored and allowing multiple BI tools to acquire data from
centralized data layer. Gain better insights from Big Data using JBoss Data Virtualization to
integrate with existing information sources.
360◦
Unified
View
Deliver a complete view of master & transactional data in real-time. The virtual data layer
serves as a unified, enterprise-wide view of business information that improves users’ ability
to understand and leverage enterprise data.
Agile SOA
Data
Services
A data virtualization layer deliver the missing data services layer to SOA applications. JBoss
Data Virtualization increases agility and loose coupling with virtual data stores without the
need to touch underlying sources and creation of data services that encapsulate the data
access logic and allowing multiple business service to acquire data from centralized data
layer.
Regulatory
Compliance
Data Virtualization layer deliver the data firewall functionality. JBoss Data Virtualization
improves data quality via centralized access control, robust security infrastructure and
reduction in physical copies of data thus reducing risk. Furthermore, the metadata
repository catalogs enterprise data locations and the relationships between the data in
various data stores, enabling transparency and visibility.
14
RED HAT Confidential
15. Big Data integration
use case
Retail Customer Use Case
Gain Better Insight from Big Data for Intelligent Inventory Management
●
Objective:
–
●
Right merchandise, at right time and price
JBoss
BRMS
Problem:
–
●
Analytical Apps
Data Driven
Decision
Management
Cannot utilize social data and sentiment
analysis with their inventory and purchase
management system
Solution:
–
Leverage JBoss Data Virtualization to
mashup Sentiment analysis data with
inventory and purchasing system data.
Leveraged BRMS to optimize pricing and
stocking decisions.
Consume
Compose
Connect
JBoss Data Virtualization
Hive
Inventory
Databases
15
RED HAT Confidential
Purchase Mgmt
Application
Sentiment
Analysis
16. Better Together - Big Data and Data Virtualization
Hadoop not another Silo - Customers Combine Multiple Technologies
●
Combine structured and unstructured analysis
–
●
Combine high velocity and historical analysis
–
●
Analyze and react to data in motion; adjust models with deep historical
analysis
Reuse structured data for analysis
–
16
Augment data warehouse with additional external sources, such as
social media
Experimentation and ad-hoc analysis with structured data
RED HAT Confidential
17. Integrate & Analyze
●
Better Together - Big Data and Data
Virtualization
Capture, Process and Integrate Data Volume, Velocity, Variety
BI Analytics
SOA Composite Applications
(historical, operational, predictive)
Capture & Process
In-memory Cache
JBoss Data Grid
Messaging and Event Processing
JBoss A-MQ and JBoss BRMS
J
Structured Data
17
Streaming
Data
RED HAT Confidential
Hadoop
Semi-Structured
Data
Red Hat Storage
Red Hat Enterprise Linux & Virtualization
Data Integration
JBoss Data Virtualization
18. Consider...
Inconsistent,
Incomplete
Information
Uninformed,
Delayed Decisions
Costly Business Risk
and Exposure
How would your organization change…
●
●
●
18
If data were readily reusable in place rather than
requiring significant effort to build new intermediary data
tiers?
If data could be repurposed quickly into new applications
and business processes?
If all applications and business processes could get all of
the information needed in the form needed, where
needed and when needed?
RED HAT Confidential
19. ●
Red Hat JBoss Middleware
Business Process
Management
•
•
JBoss BRMS
JBoss BPM Suite
Application
Integration
•
•
•
JBoss A-MQ
JBoss Fuse
JBoss Fuse Service Works
Data Integration
Foundation
ACCELERATE
19
•
•
•
•
JBoss Data
Virtualization
JBoss EAP
JBoss Web Server
JBoss Data Grid
INTEGRATE
RED HAT Confidential
AUTOMATE
JBoss Operations Network
JBoss Developer Studio
JBoss Portal
•
•
•
Management
Management
Tools
Tools
Development
Development
Toolsh
Toolsh
User Interaction
21. Demo Scenario
●
Objective:
–
●
Cannot utilize social data and
sentiment analysis with sales
management system
Consume
Compose
Connect
Solution:
–
21
Determine if sentiment data from the
first week of the Iron Man 3 movie is a
predictor of sales
Problem:
–
●
Excel Powerview and
DV Dashboard to
analyze the
aggregated data
JBoss Data Virtualization
Leverage JBoss Data Virtualization to
mashup Sentiment analysis data with
ticket and merchandise sales data on
MySQL into a single view of the data.
Hive
SOURCE 1: Hive/Hadoop
contains twitter data
including sentiment
RED HAT Confidential
SOURCE 2: MySQL data
that includes ticket and
merchandise sales
22. Demonstration System Requirements
• JDK
– Oracle JDK 1.6, 1.7 or OpenJDK 1.6 or 1.7
• JBoss Data Virtualization v6 Beta
– http://jboss.org/products/datavirt.html
• JBoss Developer Studio
– http://jboss.org/products
• JBoss Integration Stack Tools (Teiid)
– https://devstudio.jboss.com/updates/7.0-development/integration-stack/
• Slides, Code and References for demo
– https://github.com/DataVirtualizationByExample/Mashup-with-Hive-and-MyS
QL
• Hortonworks Data Platform (A VM for testing Hive/Hadoop)
– http://hortonworks.com/products/hdp-2/#install
• Red Hat Storage
– http://www.redhat.com/products/storage-server/
22
RED HAT Confidential
59. Why Red Hat for Big Data?
●
Transform ALL data into actionable information
–
Cost Effective, Comprehensive Platform
–
Community based Innovation
–
Enterprise Class Software and Support
Process
Integrate
Data to Actionable Information Cycle
59
RED HAT Confidential
Information
Data
Capture
60. ●
Red Hat Big Data
Platform
Middleware
Hadoop
Integration
JBoss Data
Virtualization
Platform
RHEL
Platform Integration
&
Optimization
op
ado
H n
o
ra
Apache
edo
F
Fedora
Big Data SIG
Hadoop
Hadoop
Distributions
Hadoop On
Red Hat Storage
Storage
60
RED HAT Confidential
Hadoop
On
OpenStack
Cloud /
Virtualization
Today the collaboration between Red Hat and SAP continues.
Engineers from both companies are working towards a common target — enhancing the interoperability of JBoss Enterprise middleware with the existing SAP landscape. Specifically, Red Hat and SAP are collaborating on development efforts for tools that are designed to simplify the integration of SAP data and business processes with other enterprise data and applications.
The aim of such integration, of course, is a more intelligent enterprise — one that can maximize the value of your data assets in accelerating business decisions.
<number>
To remember the pragmatic definition of big data, think SPA — the three questions of big data:
Store. Can you capture and store the data?
Process. Can you cleanse, enrich, and analyze the data?
Access. Can you retrieve, search, integrate, and visualize the data?
<number>
Easy data accessibility thru standard interfaces e.g SQL, Web Services etc.
Exposes non-relational sources as relational
Read and write data in place
Real time access
No data replication/duplication required
So lets define what are the attributes of Data Virtualization solution. The first thing that data virtualization product does is virtualizes the data, regardless of where it is. It makes the data look as if it was in one place. So applications don’t need to know where the data is, because the data virtualization software does that for you.
The second thing that data virtualization does is federating the data. You’re running a query which spans multiple databases or data warehouses. You want that query to run sufficiently and with optimum performance. So in order to do that, you need a variety of techniques, like caching, like pushdown optimization, you need to have knowledge of the source databases to make this whole environment run as smoothly and efficiently as possible.
Thirdly, it abstracts the data into the format of choice. It conforms the data so that it’s in a consistent format, and that’s regardless of the native structure or syntax of the data. And one point I should make here is that you want to be able to – you don’t want a tool which will force you to have a particular format. What you want is a format that suits your business, rather than one which is imposed on you. So you need to have, the data virtualization tool itself needs to be agile and flexible, in the sense of being able to provide a data format that suits you.
And then the fourth thing you have a requirement for is to present the data in a consistent fashion. And it doesn’t matter whether it’s a business intelligence application, it’s a mash-up, it’s a regular application; whatever it is, you want to be able to present the data in a consistent format to the business, to participating applications.
Imagine if all the up-to-date data you need to take informed action, is available to you on demand as one unified source. This is the capability provided by Data Virtualization software.
<number>
Easy data accessibility thru standard interfaces e.g SQL, Web Services etc.
Exposes non-relational sources as relational
Read and write data in place
Real time access
No data replication/duplication required
The data virtualization software provides 3 step process to connect data sources and data consumers:
Connect: Fast Access to data from disparate systems (databases, files, services, applications, etc.) with disparate access method and storage models.
Compose: Easily create reusable, unified common data model and virtual data views by combining and transforming data from multiple sources.
Consume: Seamlessly exposing unified, virtual data model and views available in real-time through a variety of open standards data access methods to support different tools and applications.
JBoss Data Virtualization software implements all three steps internally while isolating/hiding complexity of data access methods, transformation and data merge logic details from information consumers.
This enables organization to acquire actionable, unified information when they want it and the way they want it; i.e. at the business speed.
<number>
To remember the pragmatic definition of big data, think SPA — the three questions of big data:
Store. Can you capture and store the data?
Process. Can you cleanse, enrich, and analyze the data?
Access. Can you retrieve, search, integrate, and visualize the data?
<number>