SlideShare uma empresa Scribd logo
1 de 37
1Pivotal Confidential–Internal Use Only 1Pivotal Confidential–Internal Use Only
Journey to an Agile
Data-Driven Enterprise
Real Time Data Stream Processing using
Pivotal Big Data Suite (BDS)
2Pivotal Confidential–Internal Use Only
Agenda
Ÿ  Problem Statement
Ÿ  Real Time Streaming Architecture
Ÿ  Problem Solution
Ÿ  Pivotal Big Data Suite
Ÿ  Demo Screenshots
Ÿ  Summary (Pivotal Differentiators)
3Pivotal Confidential–Internal Use Only
Problem Statement
Problem solution is loosely based on ACM DEBS 2015 Grand Challenge
http://www.debs2015.org/call-grand-challenge.html
4Pivotal Confidential–Internal Use Only
Problem Statement
Data Model
1.  Taxi data streamed for New York region
2.  Data contains details like taxi number,
pickup time, dropoff time, pickup and
dropoff lat/long, fare, taxes
1.  Area to be divided as squares. Each
square is 1kmx1km
Find out EVERY 10 SECONDS
a.  Inconsistent data
b.  Top 10 areas where taxies are plying the
most (Report starting and ending area and
number of taxies that traveled in these
areas. Each area is a square 1x1km)
c.  Total data processed, and time to process
data in-memory for a window of 10 seconds
d.  Free taxies available in different areas (only
50 taxies)
Analytical Queries
a.  Which taxi driver is not reporting data
correctly
b.  Top 10 taxi driver earning the most
5Pivotal Confidential–Internal Use Only
Pivotal Real Time Analytics Architecture
6Pivotal Confidential–Internal Use Only
Products used in implementing solution
7Pivotal Confidential–Internal Use Only
SpringXD
(Fast Ingestion)
Real Time Analytics Demo - Flow
Data is streamed to
network port where
springXD is listening
Net
Pkts
8Pivotal Confidential–Internal Use Only
SpringXD
(Fast Ingestion)
Real Time Analytics Demo - Flow
Spark Streaming
(In Memory analytics)
Filter (incomplete data)
Business logic
(filter, transformation,
distance calculation,
sorting)
10s moving window
Data is streamed to
network port where
springXD is listening
Data stream is
forwarded to SPARK
streaming which
collects data for every
10seconds window
and applies business
logic on batch of data
Net
Pkts
9Pivotal Confidential–Internal Use Only
SpringXD
(Fast Ingestion)
Real Time Analytics Demo - Flow
Spark Streaming
(In Memory analytics)
Gemfire
(In memory data store)
Filter (incomplete data)
Business logic
(filter, transformation,
distance calculation,
sorting)
10s moving window
Data is streamed to
network port where
springXD is listening
After processing,
SPARK uploads
various aggregated
metrics to Gemfire for
fast retrieval
Terminal Output
Net
Pkts
Data stream is
forwarded to SPARK
streaming which
collects data for every
10seconds window
and applies business
logic on batch of data
10Pivotal Confidential–Internal Use Only
SpringXD
(Fast Ingestion)
Real Time Analytics Demo - Flow
Spark Streaming
(In Memory analytics)
Gemfire
(In memory data store)
Filter
Business logic
(filter, transformation,
distance calculation,
sorting)
10s moving window
Data is streamed to
network port where
springXD is listening
Pivotal HD
(long term storage)
Terminal Output
Net
Pkts
After processing,
SPARK uploads
various aggregated
metrics to Gemfire for
fast retrieval
Data stream is
forwarded to SPARK
streaming which
collects data for every
10seconds window
and applies business
logic on batch of data
SpringXD allows
multicasting data
stream. One stream
goes to Pivotal HD for
analytics purpose
11Pivotal Confidential–Internal Use Only
SpringXD
(Fast Ingestion)
Real Time Analytics Demo - Flow
Spark Streaming
(In Memory analytics)
Gemfire
(In memory data store)
Filter
Business logic
(filter, transformation,
distance calculation,
sorting)
10s moving window
Webapp
Data is streamed to
network port where
springXD is listening
A php webapp
then shows and
refreshes data
every 10 seconds.
Pivotal HD
(long term storage)
Terminal Output
Net
Pkts
After processing,
SPARK uploads
various aggregated
metrics to Gemfire for
fast retrieval
Data stream is
forwarded to SPARK
streaming which
collects data for every
10seconds window
and applies business
logic on batch of data
SpringXD allows
multicasting data
stream. One stream
goes to Pivotal HD for
analytics purpose
12Pivotal Confidential–Internal Use Only
Real Time Analytics Demo – Tools used
Ÿ  SpringXD è Data Ingestion
Ÿ  Spark Streaming è In-memory stream computing
Ÿ  Gemfire è In- memory data store
Ÿ  HAWQ è Analytic SQL queries on Hadoop
Ÿ  Google charts è PHP based web application
13Pivotal Confidential–Internal Use Only
Pivotal Big Data Suite Is a Complete Tool Set for Data-Driven Enterprises
14Pivotal Confidential–Internal Use Only
Analytics-optimized,
OPD –based Hadoop
distribution
In-memory, distributed
processing from
Apache
Scale-out analytics
pipeline management
with data ingestion
and processing
Agile, Open Source Data Storage and Processing
As we move to combine all the data generated by
our activity and to leverage advanced analytics in
real time, there’s no better way to do that than
through the flexibility and choice provided by
Pivotal’s Big Data Suite.
– Sylvain LeBorne, EVP Data Platforms
“”
15Pivotal Confidential–Internal Use Only
Advanced Analytics Power, Speed and Flexibility
Leading analytic data
warehouse
Most advanced
analytical
SQL engine on Hadoop
100X performance improvement
analyzing trends among 500 million
job postings.
“”
16Pivotal Confidential–Internal Use Only
Distributed, in-memory database for
high-scale NoSQL applications
In-memory, data structure server for fast
read and write applications
Robust messaging for high-scale
applications
Leverage Big Data Suite data services
within Pivotal Cloud Foundry
applications
Deploy and manage Big Data Suite with
Pivotal Cloud Foundry Foundation
Low Latency, Resilient Data Stores and Messaging
300% improvement in ticket-serving
capacity led to 30% increase in e-
ticket sales.
“”
17Pivotal Confidential–Internal Use Only
Pivotal Big Data Suite Differentiators
Ÿ  Open Data Platform
–  Pivotal and Hortonworks are first two members
–  Focused on building ODP core on which Hadoop distributions will work
–  Governed by an open governance model
–  Flexibility to work on any Hadoop distribution using ODP core
–  Faster releases and third-party products certifications than any single vendor
Ÿ  Suite of Analytical Products
–  HAWQ
–  Greenplum
–  MADLib
–  PivotalR
–  Graphlab
18Pivotal Confidential–Internal Use Only
Spring XD Value
§  Unified agile experience for
–  Data Ingestion
–  Real-time Analytics
–  Workflow Orchestration
–  Data Export
§  Built on existing assets
–  Spring Integration
–  Spring Batch
–  Spring Data (Redis, GemFire, Hadoop)
§  XD = 'eXtreme Data’
–  or 'x' as a variable (big, fast, diverse)
19Pivotal Confidential–Internal Use Only
Streams
Spring XD
HTTP	
  
Tail	
  
File	
  
Mail	
  
Twi,er	
  
Gemfire	
  
Syslog	
  
TCP	
  
UDP	
  
JMS	
  
RabbitMQ	
  
MQTT	
  
Trigger	
  
ka?a	
  
jdbc	
  
Reactor	
  TCP/UDP	
  
Filter	
  
Transformer	
  
Object-­‐to-­‐JSON	
  
JSON-­‐to-­‐Tuple	
  
Spli,er	
  
Aggregator	
  
HTTP	
  Client	
  
Groovy	
  Scripts	
  
Java	
  Code	
  
JPMML	
  Evaluator	
  
	
  
File	
  
HDFS	
  
JDBC	
  
	
  Mongo	
  
TCP	
  
Log	
  
Mail	
  
RabbitMQ	
  
Gemfire	
  
Splunk	
  
MQTT	
  
Dynamic	
  Router	
  
Counters	
  
Redis	
  
Ka?a	
  
20Pivotal Confidential–Internal Use Only
What is Spark Streaming?
Ÿ  Extends Spark for doing large scale stream processing
Ÿ  Scales to 100s of nodes and achieves second scale latencies
Ÿ  Efficient and fault-tolerant stateful stream processing
Ÿ  Integrates with Spark’s batch and interactive processing
Ÿ  Provides a simple batch-like API for implementing complex
algorithms
21Pivotal Confidential–Internal Use Only
Discretized Stream Processing
Run a streaming computation as a series of very small,
deterministic batch jobs
21	

Spark	
  
Spark	
  
Streaming	
  
batches	
  of	
  X	
  seconds	
  
live	
  data	
  stream	
  
processed	
  
results	
  
§  Chop	
  up	
  the	
  live	
  stream	
  into	
  batches	
  of	
  X	
  
seconds	
  	
  
§  Spark	
  treats	
  each	
  batch	
  of	
  data	
  as	
  RDDs	
  and	
  
processes	
  them	
  using	
  RDD	
  operaSons	
  
§  Finally,	
  the	
  processed	
  results	
  of	
  the	
  RDD	
  
operaSons	
  are	
  returned	
  in	
  batches	
  
22Pivotal Confidential–Internal Use Only
Discretized Stream Processing
Run a streaming computation as a series of very small,
deterministic batch jobs
22	

§  Batch	
  sizes	
  as	
  low	
  as	
  ½	
  second,	
  latency	
  ~	
  1	
  
second	
  
§  PotenSal	
  for	
  combining	
  batch	
  processing	
  and	
  
streaming	
  processing	
  in	
  the	
  same	
  system	
  
Spark	
  
Spark	
  
Streaming	
  
batches	
  of	
  X	
  seconds	
  
live	
  data	
  stream	
  
processed	
  
results	
  
23Pivotal Confidential–Internal Use Only
Pivotal HD Value
•  Cost-based Query Optimizer
•  ANSI SQL Compliant
•  Linear, incremental scalability on
commodity/appliance hardware
•  Deep Analytic OLAP Queries
•  Petabyte Data Storage &
Management
•  Low latency updates and
transactions
•  Active-active deployment across
WAN
OLAP OLTP
SQL
HDFS
24Pivotal Confidential–Internal Use Only
Pivotal HAWQ Value
Ÿ  ORCA – New Query Optimizer
Ÿ  Open Data Format (Parquet)
Ÿ  Additional Analytics
–  PL/PGSQL, PL/R, PL/PYTHON
Ÿ  Security – Kerberos authentication
support
Ÿ  Updated Diagnostic Tools
Ÿ  Automated High Availability
25Pivotal Confidential–Internal Use Only
GemFire – The Enterprise Data Fabric
– A distributed, memory-based data management platform.
–  Gartner -> In Memory Data Grid (IMDG)
– ACID Transactional behaviour on IMDG
– Provides continuous availability, high performance, and
linear scalability for data intensive applications.
– Allows for configurable data consistency.
– Event driven data architecture
25
26Pivotal Confidential–Internal Use Only
GemFire – The Enterprise Data Fabric
26
Pivotal GemFire Data Fabric
Reliable Notification
High Scalability WAN Distribution
Continuous Querying Parallel Execution
Continuous Availability Low latency
Data Durability
Enterprise data consuming application
Conventional data
storage systemsFile Database
Other data
Storage System
27Pivotal Confidential–Internal Use Only
DEMO SCREENSHOTS
Note - All products are installed on 1 VM
28Pivotal Confidential–Internal Use Only
Reports total number of streams
process, total number of streams
that lacks data and how much time
did spark took to process data
collected in 10 seconds window.
This data is retrieved from Gemfire
Reports top routes and number of
trips in those routes, pickup and
dropoff time. This data is for last 10
seconds. Refreshes every 10
seconds. This data is retrieved from
Gemfire
29Pivotal Confidential–Internal Use Only
Changed data in the next 10
seconds window. This data is
retrieved from Gemfire
30Pivotal Confidential–Internal Use Only
Changed data in the next 10
seconds window. This data is
retrieved from Gemfire
31Pivotal Confidential–Internal Use Only
Visual representation of top three
routes.
Number 1 is blue
Number 2 is green
Number 3 is pink
Straight pin is origin square and
tilted pin is ending square.
This data is retrieved from Gemfire
32Pivotal Confidential–Internal Use Only
Changed data in the next 10
seconds window. This data is
retrieved from Gemfire
33Pivotal Confidential–Internal Use Only
Top 50 taxies available at various
coordinates in last 10 seconds of
data processed. This data is
retrieved from Gemfire
34Pivotal Confidential–Internal Use Only
Changed data in the next 10
seconds window. This data is
retrieved from Gemfire
35Pivotal Confidential–Internal Use Only
This showcases power of HAWQ.
Recall that streamed data is put
into HDFS as well. And we use
SQL queries to query on data
stored on HDFS.
36Pivotal Confidential–Internal Use Only
37Pivotal Confidential–Internal Use Only
Pivotal Big Data Suite
ü Open Data Platform
ü Suite of Analytical Products
ü Team of Data Scientists
ü Open Source Commitment
ü Enterprise Level Support
ü Best in class In-Memory Data grid solution
ü ONE platform for apps, data and mobile services

Mais conteúdo relacionado

Mais procurados

Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in RealtimeDataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaDataWorks Summit
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkDataWorks Summit
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...DataWorks Summit
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on HadoopTyler Mitchell
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at UberDataWorks Summit
 
Depositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske BankDepositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske BankDataWorks Summit/Hadoop Summit
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...DataWorks Summit
 

Mais procurados (20)

Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and Kafka
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on Flink
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Streamline - Stream Analytics for Everyone
Streamline - Stream Analytics for EveryoneStreamline - Stream Analytics for Everyone
Streamline - Stream Analytics for Everyone
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at Uber
 
Depositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske BankDepositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske Bank
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...
 
Rebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for ScaleRebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for Scale
 

Semelhante a Real-Time Taxi Analytics with Pivotal Big Data Suite

Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
Big Data Applications Made Easy: Fact Or Fiction?
Big Data Applications Made Easy: Fact Or Fiction?Big Data Applications Made Easy: Fact Or Fiction?
Big Data Applications Made Easy: Fact Or Fiction?Glenn Renfro
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsVoltDB
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemDan Eaton
 
The Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseThe Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseAltibase
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit
 
SnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineData Con LA
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit
 

Semelhante a Real-Time Taxi Analytics with Pivotal Big Data Suite (20)

Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Big Data Applications Made Easy: Fact Or Fiction?
Big Data Applications Made Easy: Fact Or Fiction?Big Data Applications Made Easy: Fact Or Fiction?
Big Data Applications Made Easy: Fact Or Fiction?
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Sql 2017 net raf
Sql 2017  net rafSql 2017  net raf
Sql 2017 net raf
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
 
Sql 2016 2017 full
Sql 2016   2017 fullSql 2016   2017 full
Sql 2016 2017 full
 
The Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseThe Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- Altibase
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
SnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark Meetup
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 

Último

Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 

Último (20)

Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 

Real-Time Taxi Analytics with Pivotal Big Data Suite

  • 1. 1Pivotal Confidential–Internal Use Only 1Pivotal Confidential–Internal Use Only Journey to an Agile Data-Driven Enterprise Real Time Data Stream Processing using Pivotal Big Data Suite (BDS)
  • 2. 2Pivotal Confidential–Internal Use Only Agenda Ÿ  Problem Statement Ÿ  Real Time Streaming Architecture Ÿ  Problem Solution Ÿ  Pivotal Big Data Suite Ÿ  Demo Screenshots Ÿ  Summary (Pivotal Differentiators)
  • 3. 3Pivotal Confidential–Internal Use Only Problem Statement Problem solution is loosely based on ACM DEBS 2015 Grand Challenge http://www.debs2015.org/call-grand-challenge.html
  • 4. 4Pivotal Confidential–Internal Use Only Problem Statement Data Model 1.  Taxi data streamed for New York region 2.  Data contains details like taxi number, pickup time, dropoff time, pickup and dropoff lat/long, fare, taxes 1.  Area to be divided as squares. Each square is 1kmx1km Find out EVERY 10 SECONDS a.  Inconsistent data b.  Top 10 areas where taxies are plying the most (Report starting and ending area and number of taxies that traveled in these areas. Each area is a square 1x1km) c.  Total data processed, and time to process data in-memory for a window of 10 seconds d.  Free taxies available in different areas (only 50 taxies) Analytical Queries a.  Which taxi driver is not reporting data correctly b.  Top 10 taxi driver earning the most
  • 5. 5Pivotal Confidential–Internal Use Only Pivotal Real Time Analytics Architecture
  • 6. 6Pivotal Confidential–Internal Use Only Products used in implementing solution
  • 7. 7Pivotal Confidential–Internal Use Only SpringXD (Fast Ingestion) Real Time Analytics Demo - Flow Data is streamed to network port where springXD is listening Net Pkts
  • 8. 8Pivotal Confidential–Internal Use Only SpringXD (Fast Ingestion) Real Time Analytics Demo - Flow Spark Streaming (In Memory analytics) Filter (incomplete data) Business logic (filter, transformation, distance calculation, sorting) 10s moving window Data is streamed to network port where springXD is listening Data stream is forwarded to SPARK streaming which collects data for every 10seconds window and applies business logic on batch of data Net Pkts
  • 9. 9Pivotal Confidential–Internal Use Only SpringXD (Fast Ingestion) Real Time Analytics Demo - Flow Spark Streaming (In Memory analytics) Gemfire (In memory data store) Filter (incomplete data) Business logic (filter, transformation, distance calculation, sorting) 10s moving window Data is streamed to network port where springXD is listening After processing, SPARK uploads various aggregated metrics to Gemfire for fast retrieval Terminal Output Net Pkts Data stream is forwarded to SPARK streaming which collects data for every 10seconds window and applies business logic on batch of data
  • 10. 10Pivotal Confidential–Internal Use Only SpringXD (Fast Ingestion) Real Time Analytics Demo - Flow Spark Streaming (In Memory analytics) Gemfire (In memory data store) Filter Business logic (filter, transformation, distance calculation, sorting) 10s moving window Data is streamed to network port where springXD is listening Pivotal HD (long term storage) Terminal Output Net Pkts After processing, SPARK uploads various aggregated metrics to Gemfire for fast retrieval Data stream is forwarded to SPARK streaming which collects data for every 10seconds window and applies business logic on batch of data SpringXD allows multicasting data stream. One stream goes to Pivotal HD for analytics purpose
  • 11. 11Pivotal Confidential–Internal Use Only SpringXD (Fast Ingestion) Real Time Analytics Demo - Flow Spark Streaming (In Memory analytics) Gemfire (In memory data store) Filter Business logic (filter, transformation, distance calculation, sorting) 10s moving window Webapp Data is streamed to network port where springXD is listening A php webapp then shows and refreshes data every 10 seconds. Pivotal HD (long term storage) Terminal Output Net Pkts After processing, SPARK uploads various aggregated metrics to Gemfire for fast retrieval Data stream is forwarded to SPARK streaming which collects data for every 10seconds window and applies business logic on batch of data SpringXD allows multicasting data stream. One stream goes to Pivotal HD for analytics purpose
  • 12. 12Pivotal Confidential–Internal Use Only Real Time Analytics Demo – Tools used Ÿ  SpringXD è Data Ingestion Ÿ  Spark Streaming è In-memory stream computing Ÿ  Gemfire è In- memory data store Ÿ  HAWQ è Analytic SQL queries on Hadoop Ÿ  Google charts è PHP based web application
  • 13. 13Pivotal Confidential–Internal Use Only Pivotal Big Data Suite Is a Complete Tool Set for Data-Driven Enterprises
  • 14. 14Pivotal Confidential–Internal Use Only Analytics-optimized, OPD –based Hadoop distribution In-memory, distributed processing from Apache Scale-out analytics pipeline management with data ingestion and processing Agile, Open Source Data Storage and Processing As we move to combine all the data generated by our activity and to leverage advanced analytics in real time, there’s no better way to do that than through the flexibility and choice provided by Pivotal’s Big Data Suite. – Sylvain LeBorne, EVP Data Platforms “”
  • 15. 15Pivotal Confidential–Internal Use Only Advanced Analytics Power, Speed and Flexibility Leading analytic data warehouse Most advanced analytical SQL engine on Hadoop 100X performance improvement analyzing trends among 500 million job postings. “”
  • 16. 16Pivotal Confidential–Internal Use Only Distributed, in-memory database for high-scale NoSQL applications In-memory, data structure server for fast read and write applications Robust messaging for high-scale applications Leverage Big Data Suite data services within Pivotal Cloud Foundry applications Deploy and manage Big Data Suite with Pivotal Cloud Foundry Foundation Low Latency, Resilient Data Stores and Messaging 300% improvement in ticket-serving capacity led to 30% increase in e- ticket sales. “”
  • 17. 17Pivotal Confidential–Internal Use Only Pivotal Big Data Suite Differentiators Ÿ  Open Data Platform –  Pivotal and Hortonworks are first two members –  Focused on building ODP core on which Hadoop distributions will work –  Governed by an open governance model –  Flexibility to work on any Hadoop distribution using ODP core –  Faster releases and third-party products certifications than any single vendor Ÿ  Suite of Analytical Products –  HAWQ –  Greenplum –  MADLib –  PivotalR –  Graphlab
  • 18. 18Pivotal Confidential–Internal Use Only Spring XD Value §  Unified agile experience for –  Data Ingestion –  Real-time Analytics –  Workflow Orchestration –  Data Export §  Built on existing assets –  Spring Integration –  Spring Batch –  Spring Data (Redis, GemFire, Hadoop) §  XD = 'eXtreme Data’ –  or 'x' as a variable (big, fast, diverse)
  • 19. 19Pivotal Confidential–Internal Use Only Streams Spring XD HTTP   Tail   File   Mail   Twi,er   Gemfire   Syslog   TCP   UDP   JMS   RabbitMQ   MQTT   Trigger   ka?a   jdbc   Reactor  TCP/UDP   Filter   Transformer   Object-­‐to-­‐JSON   JSON-­‐to-­‐Tuple   Spli,er   Aggregator   HTTP  Client   Groovy  Scripts   Java  Code   JPMML  Evaluator     File   HDFS   JDBC    Mongo   TCP   Log   Mail   RabbitMQ   Gemfire   Splunk   MQTT   Dynamic  Router   Counters   Redis   Ka?a  
  • 20. 20Pivotal Confidential–Internal Use Only What is Spark Streaming? Ÿ  Extends Spark for doing large scale stream processing Ÿ  Scales to 100s of nodes and achieves second scale latencies Ÿ  Efficient and fault-tolerant stateful stream processing Ÿ  Integrates with Spark’s batch and interactive processing Ÿ  Provides a simple batch-like API for implementing complex algorithms
  • 21. 21Pivotal Confidential–Internal Use Only Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 21 Spark   Spark   Streaming   batches  of  X  seconds   live  data  stream   processed   results   §  Chop  up  the  live  stream  into  batches  of  X   seconds     §  Spark  treats  each  batch  of  data  as  RDDs  and   processes  them  using  RDD  operaSons   §  Finally,  the  processed  results  of  the  RDD   operaSons  are  returned  in  batches  
  • 22. 22Pivotal Confidential–Internal Use Only Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 22 §  Batch  sizes  as  low  as  ½  second,  latency  ~  1   second   §  PotenSal  for  combining  batch  processing  and   streaming  processing  in  the  same  system   Spark   Spark   Streaming   batches  of  X  seconds   live  data  stream   processed   results  
  • 23. 23Pivotal Confidential–Internal Use Only Pivotal HD Value •  Cost-based Query Optimizer •  ANSI SQL Compliant •  Linear, incremental scalability on commodity/appliance hardware •  Deep Analytic OLAP Queries •  Petabyte Data Storage & Management •  Low latency updates and transactions •  Active-active deployment across WAN OLAP OLTP SQL HDFS
  • 24. 24Pivotal Confidential–Internal Use Only Pivotal HAWQ Value Ÿ  ORCA – New Query Optimizer Ÿ  Open Data Format (Parquet) Ÿ  Additional Analytics –  PL/PGSQL, PL/R, PL/PYTHON Ÿ  Security – Kerberos authentication support Ÿ  Updated Diagnostic Tools Ÿ  Automated High Availability
  • 25. 25Pivotal Confidential–Internal Use Only GemFire – The Enterprise Data Fabric – A distributed, memory-based data management platform. –  Gartner -> In Memory Data Grid (IMDG) – ACID Transactional behaviour on IMDG – Provides continuous availability, high performance, and linear scalability for data intensive applications. – Allows for configurable data consistency. – Event driven data architecture 25
  • 26. 26Pivotal Confidential–Internal Use Only GemFire – The Enterprise Data Fabric 26 Pivotal GemFire Data Fabric Reliable Notification High Scalability WAN Distribution Continuous Querying Parallel Execution Continuous Availability Low latency Data Durability Enterprise data consuming application Conventional data storage systemsFile Database Other data Storage System
  • 27. 27Pivotal Confidential–Internal Use Only DEMO SCREENSHOTS Note - All products are installed on 1 VM
  • 28. 28Pivotal Confidential–Internal Use Only Reports total number of streams process, total number of streams that lacks data and how much time did spark took to process data collected in 10 seconds window. This data is retrieved from Gemfire Reports top routes and number of trips in those routes, pickup and dropoff time. This data is for last 10 seconds. Refreshes every 10 seconds. This data is retrieved from Gemfire
  • 29. 29Pivotal Confidential–Internal Use Only Changed data in the next 10 seconds window. This data is retrieved from Gemfire
  • 30. 30Pivotal Confidential–Internal Use Only Changed data in the next 10 seconds window. This data is retrieved from Gemfire
  • 31. 31Pivotal Confidential–Internal Use Only Visual representation of top three routes. Number 1 is blue Number 2 is green Number 3 is pink Straight pin is origin square and tilted pin is ending square. This data is retrieved from Gemfire
  • 32. 32Pivotal Confidential–Internal Use Only Changed data in the next 10 seconds window. This data is retrieved from Gemfire
  • 33. 33Pivotal Confidential–Internal Use Only Top 50 taxies available at various coordinates in last 10 seconds of data processed. This data is retrieved from Gemfire
  • 34. 34Pivotal Confidential–Internal Use Only Changed data in the next 10 seconds window. This data is retrieved from Gemfire
  • 35. 35Pivotal Confidential–Internal Use Only This showcases power of HAWQ. Recall that streamed data is put into HDFS as well. And we use SQL queries to query on data stored on HDFS.
  • 37. 37Pivotal Confidential–Internal Use Only Pivotal Big Data Suite ü Open Data Platform ü Suite of Analytical Products ü Team of Data Scientists ü Open Source Commitment ü Enterprise Level Support ü Best in class In-Memory Data grid solution ü ONE platform for apps, data and mobile services