Speaker: Matt Casters, Chief Architect & PDI/Kettle Project Founder at Pentaho
Video: http://www.youtube.com/watch?v=r7BEp-C60bQ&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=8
Traditionally, data is delivered to business analytics tools through a relational database. However, there are cases where that can be inconvenient, for example when the volume of data is just too high or when you can't wait until the database tables are updated.
This presentation by Pentaho Kettle founder Matt Casters will demonstrate a solution of data 'Blending', which allows a data integration user to create a transformation capable of delivering data directly to Pentaho - and other - business analytics tools. Matt will demonstrate taking data from Cassandra, and blending it with other data from both SQL and NoSQL sources, and then visualizing that data. Matt will explain how it becomes possible to create a virtual "database" with "tables" where the data actually comes from a transformation step.
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
C* Summit EU 2013: Blending Cassandra Data Into The Mix
1. Blending Cassandra Data Into the mix
Matt Casters| Chief Architect, Data Integration at Pentaho
Kettle Project Founder
#CASSANDRAEU
CASSANDRASUMMIT
EU
2. What we will discuss today…
*
*
*
*
About Pentaho
Blended Big Data Integration
Demo
Takeaway & QA
#CASSANDRAEU
CASSANDRASUMMIT
EU
4. Pentaho Mission
Enabling the future of analytics
Modern unified business analytics and data
integration platform
•
•
•
•
Full spectrum of advancing analytics for all key roles
Embeddable, cloud-ready analytics
Big data blending for analytics in real-time environments
Broadest and deepest big data integration
•
•
Open, pluggable, purpose built for the future
Early sustained leadership in big data
ecosystem with technology innovation
Innovation through open source
Critical mass achieved
•
•
Over 1,200 commercial customers
Over 10,000 production deployments
#CASSANDRAEU
CASSANDRASUMMIT
EU
5. Pentaho and Cassandra
* ETL and Analytics that complement Cassandra
* Create data transformations from source systems into
Cassandra, and Cassandra to target systems, via drag and
drop
* Quickly visualize and explore data inside Cassandra with
Pentaho Data Services
* Deeper Casandra/Pentaho integration in development
* Keep up with the latest Cassandra developments
* Provide underlying API compatibility layer
#CASSANDRAEU
CASSANDRASUMMIT
EU
6. The New Reality
Simplified Analysis for all Users
Billing
Social
Media
Customer
Analytics
Existing & New Data
Infrastructure &
Processes
Web
Location Network
ANY Data
ANY Environment
ANY Analytics
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Relational
Operational
Big Data
Data sources not yet
anticipated…
#CASSANDRAEU
Data warehouses
Data marts
Stack vendors
Cloud
Embedded
Reports
Dashboards
Visualizations
Discovery
Predictive
CASSANDRASUMMIT
EU
7. Pentaho 5.0 Architected for the Future
Simplified analytics experience for all users
Simplified
Analytics
Experience
Blended
Big Data
Enterprise
Big Data
Integration
#CASSANDRAEU
CASSANDRASUMMIT
EU
8. Basic Cassandra Use Case
• Enterprise Customer Data Store
• Visual ETL development
with Pentaho Data
Integration
• Reporting, Dashboards,
Visualization and Data
discovery with full spectrum
analytics
System
Scope
Source Systems
…
Pentaho Data
Integration
Enterprise Data Store
Pentaho Analytics
• Reporting
• Dashboards
• Visualization
• Discovery
Pentaho Data
Integration
Target Systems
#CASSANDRAEU
CASSANDRASUMMIT
EU
14. Analytics on Cassandra– Two Approaches
PDI Data
Services
Direct Access
Analytics
Cassandra
cluster
Access via Database
Analytics
PDI ETL
RDBMS
#CASSANDRAEU
CASSANDRASUMMIT
EU
15. Direct Access to Cassandra Data
Cassandra
cluster
PDI ETP
Pentaho Operational Reports
Extract -> Transform -> Present
Pentaho Operational Dashboards
#CASSANDRAEU
CASSANDRASUMMIT
EU
17. Customer Value from Big Data
Monetizing big data-driven use cases driving need to blend data
Drive incremental revenue
•
Predict customer behavior across all channels
•
Understand and monetize customer behavior
•
Begin to monetize data as a service
Improve operational effectiveness
•
Machines/sensors: predict failures, network attacks
•
Financial risk management: reduce fraud, increase security
Reduce data warehouse cost
•
•
#CASSANDRAEU
Integrate new data sources without increased database cost
Provide online access to ‘dark data’
CASSANDRASUMMIT
EU
18. Why Blending at the Source Matters
Customer Experience Analytics for loyalty and revenue
Customer
Provisioning
Existin
g
ETL
Tool
or PDI
Call Detail Records from:
• Billing
• Payment
• Usage
PDI
Network
PDI
Analyze quality of service:
EDW
Billing
Analytics
•
•
•
•
Blend revenue-related and
quality-of-service data
together to find customers at
risk
NoSQL
Network outages
Dropped calls
Poor quality
Calls to support center
For profiles of customers:
•
•
•
•
Up for renewal
Profitable
Multiple agreements/services
In competitive area
Determine best action to take:
Location
#CASSANDRAEU
Call Detail Records from Network:
• Outages
• Drops
• Service Quality
•
•
•
Billing Credit
Customer Coupon
No Action
CASSANDRASUMMIT
EU
19. Accurate, Blended Big Data Analytics
Optimally stored data, blended when needed
• Just in time blending of data from multiple sources for a complete picture
• Connect, combine and transform data from multiple sources
• Query data directly from any transformation
• Access architected blends with the full spectrum of Pentaho Analytics
• Manage governance and security of data for on-going accuracy
Custom
er
Provisioning
Existin
g
ETL
Tool
or PDI
EDW
Billing
Just in time blending
PDI
Network
PDI
Analytics
NoS
QL
Location
#CASSANDRAEU
CASSANDRASUMMIT
EU
20. Bring More Big Data to Life
Adaptive Big Data Layer: broadest, deepest big data support
Broadest options for storing and blending data
• New analytic use case templates for Hadoop and
Splunk
• Deeper NoSQL integration to and direct reporting
• Hadoop high availability support with MapR
• Expanded big data integration
•
•
#CASSANDRAEU
New integrations: Redshift, Impala and Splunk
New certifications: DataStax , Cassandra , Intel,
Hortonworks, latest Cloudera, MapR, MongoDB, …
CASSANDRASUMMIT
EU
21. Demo!
Demonstrate how to easily write to and read from Cassandra
Demonstrate how to blend data
#CASSANDRAEU
CASSANDRASUMMIT
EU
23. Pentaho 5.0 key takeaways
Meeting the demands of the big data-driven enterprise
Analytics
Simplified analytics experience with a
new modern interface
Blended
Big Data
Blended Big Data at the source for more
accurate insights
Enterprise
Big Data
Integration
#CASSANDRAEU
Enterprise-ready data integration and simplified
embedding for any environment
CASSANDRASUMMIT
EU
Icons are nice and the build-order is great!My suggestion the top 3 icons on the left-hand side:CustomerProvisioningBillingSuggestion for the bottom 3 icons:WebNetworkSocial Media(note: Location seems to be important to AT&T but we can just mention this)I need to come up with an explanation for why the arrow below “Just in Time Integration” is bi-directional instead of just flowing to Analytics
Icons are nice and the build-order is great!My suggestion the top 3 icons on the left-hand side:CustomerProvisioningBillingSuggestion for the bottom 3 icons:WebNetworkSocial Media(note: Location seems to be important to AT&T but we can just mention this)I need to come up with an explanation for why the arrow below “Just in Time Integration” is bi-directional instead of just flowing to Analytics
Let’s look at an example of blending at the source to better understand these points. Here we are looking at an example of Telco customer experience analytics. Customer Experience Analytics have the same goal in every industry – preventing customer churn and creating better loyalty in order to protect and grow revenue – after all, in this age of commodotization, service and fast response to product requests become the new differentiators driving loyalty in most industries. Telco customer allegiance comes mostly from satisfaction with calling plans and the quality and availability of service. Call detail records have long been created and derived from the operational systems for access to BI and reporting systems via warehousing, but they only make up part of the picture. (Build click 2) Quality of service changes in real time dependent on the network – was the customer able to connect, to hear, to remain connected without being dropped, etc? This network-based data is usually captured in a Big Data source that is capable of handling the volume and unstructured nature of the data, and must be blended with the Call Detail Record information to give the complete picture of a customer’s experience.(Build click 3) With Pentaho, you can easily create architected, blended views across both the traditional Call Detail Records in the warehouse, and the network data streaming into the Big Data/NoSQLstore (MongoDB in this example) without sacrificing the governance or performance you expect. These blended views allow your analysts and customer call centers to get this accurate, of-the-minute information in real time to determine the best action to take for each customer to maximize their satisfaction and retain them as loyal customers even when outages or other service quality issues occur.
Other solutions in the market talk about blending - but it’s not apples to apples. Blending “at the glass”, i.e. blending done by end users or analysts away from the source with no knowledge of the underlying semantics, often delivers inaccurate or even completely incorrect results, as there is no way to ensure that the chosen fields being matched truly do match. For instance, think what happens when someone matches two fields both named “revenue” in records that match on “customer”, but one is a monthly sum total and the other is a daily total – this won’t be apparent to that analyst since they are blending based on similar names. The analyst then runs a summation that adds the two together as the day’s total revenue from that customer. He/she will have unwittingly added the monthly figure into each day’s total, distorting the actual revenue generated from that customer dramatically. Your business then targets that customer as highly profitable and offers significant discounts to maintain their interest. Not only have you targeted the wrong customer and potentially ignored the real profitable customers in favor of him, but you’ve also now given him undeserved discounts. The net result lowers your revenue from this customer, and potentially loses you profitable others who were more deserving but left you in favor of competitors offering them discounts. You’ve made the wrong decision because the analytics themselves were inaccurate and incorrect. Your only choice to avoid this with tools that blend like this is to train every user and analyst on the semantics of the data to ensure reliable results – a solution that’s largely infeasible for most organizations as it would take far too much time and expense while impacting productivity. Even if you can take on this level of investment in training, you still face issues with the timeliness of the data, since these tools do not pull from the source systems. How do you know the data pulled is indeed the latest and therefore the most accurate on that level as well?
This “just in time”, architected blending delivers accurate big data analytics based on the blended data. You can connect to, combine, and even transform data from any of the multiple data stores in your hybrid data ecosystem into these blended views, then query the data directly via that view using the full spectrum of analytics in the Pentaho Analytics platform, including predictive analytics. Most importantly, since these blends are architected on the source data, you maintain all the rules of governance and security over the data while providing the ease of use and real time access needed for today’s agile analytics requirements. Sensitive data is kept from those who are not allowed to use or view it. You maintain full lifecycle and change management and control, so you can assure the blends being used meet changing requirements. You preserve auditability. Your blends are designed with full knowledge of the underlying data volumes and source system capabilities and constraints, preserving throughput and performance during analytic access and preventing the “query from hell”/”runaway query” problems prevalent in many federation tools. Combining the power of design via drag-and-drop across all data sources, including schemas generated on read from big data sources, with knowledge of the full data semantics - the real meaning, cardinality, and match-ability of fields and values in the data - means your business gets accurate results in its analytics, leading to optimized decisions and actions that can really impact your business positively and improve your results.
As part of our any data, any source ability, we now have the broadest and most sophisticated way to access big data sources. With specific big data templates we again can reduce IT barriers (programming and coding) and allow users to access their data with ease. Our product also allows businesses to scale into and deploy big data as they see fit. As part of any source, should a business try Cassandra, but realize they prefer MongoDB, they can easily access both types of data, expanding their integration ability