Real-Time Taxi Analytics with Pivotal Big Data Suite

1Pivotal Confidential–Internal Use Only 1Pivotal Confidential–Internal Use Only
Journey to an Agile
Data-Driven Enterprise
Real Time Data Stream Processing using
Pivotal Big Data Suite (BDS)

2Pivotal Confidential–Internal Use Only
Agenda
Ÿ  Problem Statement
Ÿ  Real Time Streaming Architecture
Ÿ  Problem Solution
Ÿ  Pivotal Big Data Suite
Ÿ  Demo Screenshots
Ÿ  Summary (Pivotal Differentiators)

Problem Statement
Problem solution is loosely based on ACM DEBS 2015 Grand Challenge
http://www.debs2015.org/call-grand-challenge.html

Problem Statement
Data Model
1.  Taxi data streamed for New York region
2.  Data contains details like taxi number,
pickup time, dropoff time, pickup and
dropoff lat/long, fare, taxes
1.  Area to be divided as squares. Each
square is 1kmx1km
Find out EVERY 10 SECONDS
a.  Inconsistent data
b.  Top 10 areas where taxies are plying the
most (Report starting and ending area and
number of taxies that traveled in these
areas. Each area is a square 1x1km)
c.  Total data processed, and time to process
data in-memory for a window of 10 seconds
d.  Free taxies available in different areas (only
50 taxies)
Analytical Queries
a.  Which taxi driver is not reporting data
correctly
b.  Top 10 taxi driver earning the most

Pivotal Real Time Analytics Architecture

Products used in implementing solution

SpringXD
(Fast Ingestion)
Real Time Analytics Demo - Flow
Data is streamed to
network port where
springXD is listening
Net
Pkts

SpringXD
(Fast Ingestion)
Spark Streaming
(In Memory analytics)
Filter (incomplete data)
Business logic
(filter, transformation,
distance calculation,
sorting)
10s moving window
Data is streamed to
network port where
Data stream is
forwarded to SPARK
streaming which
collects data for every
10seconds window
and applies business
logic on batch of data
Net
Pkts

SpringXD
(Fast Ingestion)
Spark Streaming
Gemfire
(In memory data store)
Filter (incomplete data)
Business logic
sorting)
10s moving window
Data is streamed to
network port where
After processing,
SPARK uploads
various aggregated
metrics to Gemfire for
fast retrieval
Terminal Output
Net
Pkts
Data stream is
forwarded to SPARK
streaming which
10seconds window

SpringXD
(Fast Ingestion)
Spark Streaming
Gemfire
Filter
Business logic
sorting)
10s moving window
Data is streamed to
network port where
Pivotal HD
(long term storage)
Terminal Output
Net
Pkts
After processing,
SPARK uploads
various aggregated
fast retrieval
Data stream is
forwarded to SPARK
streaming which
10seconds window
SpringXD allows
multicasting data
stream. One stream
goes to Pivotal HD for
analytics purpose

SpringXD
(Fast Ingestion)
Spark Streaming
Gemfire
Filter
Business logic
sorting)
10s moving window
Webapp
Data is streamed to
network port where
A php webapp
then shows and
refreshes data
every 10 seconds.
Pivotal HD
(long term storage)
Terminal Output
Net
Pkts
After processing,
SPARK uploads
various aggregated
fast retrieval
Data stream is
forwarded to SPARK
streaming which
10seconds window
SpringXD allows
multicasting data
stream. One stream
goes to Pivotal HD for
analytics purpose

Real Time Analytics Demo – Tools used
Ÿ  SpringXD è Data Ingestion
Ÿ  Spark Streaming è In-memory stream computing
Ÿ  Gemfire è In- memory data store
Ÿ  HAWQ è Analytic SQL queries on Hadoop
Ÿ  Google charts è PHP based web application

Pivotal Big Data Suite Is a Complete Tool Set for Data-Driven Enterprises

Analytics-optimized,
OPD –based Hadoop
distribution
In-memory, distributed
processing from
Apache
Scale-out analytics
pipeline management
with data ingestion
and processing
Agile, Open Source Data Storage and Processing
As we move to combine all the data generated by
our activity and to leverage advanced analytics in
real time, there’s no better way to do that than
through the flexibility and choice provided by
Pivotal’s Big Data Suite.
– Sylvain LeBorne, EVP Data Platforms
“”

Advanced Analytics Power, Speed and Flexibility
Leading analytic data
warehouse
Most advanced
analytical
SQL engine on Hadoop
100X performance improvement
analyzing trends among 500 million
job postings.
“”

Distributed, in-memory database for
high-scale NoSQL applications
In-memory, data structure server for fast
read and write applications
Robust messaging for high-scale
applications
Leverage Big Data Suite data services
within Pivotal Cloud Foundry
applications
Deploy and manage Big Data Suite with
Pivotal Cloud Foundry Foundation
Low Latency, Resilient Data Stores and Messaging
300% improvement in ticket-serving
capacity led to 30% increase in e-
ticket sales.
“”

Pivotal Big Data Suite Differentiators
Ÿ  Open Data Platform
–  Pivotal and Hortonworks are first two members
–  Focused on building ODP core on which Hadoop distributions will work
–  Governed by an open governance model
–  Flexibility to work on any Hadoop distribution using ODP core
–  Faster releases and third-party products certifications than any single vendor
Ÿ  Suite of Analytical Products
–  HAWQ
–  Greenplum
–  MADLib
–  PivotalR
–  Graphlab

Spring XD Value
§  Unified agile experience for
–  Data Ingestion
–  Real-time Analytics
–  Workflow Orchestration
–  Data Export
§  Built on existing assets
–  Spring Integration
–  Spring Batch
–  Spring Data (Redis, GemFire, Hadoop)
§  XD = 'eXtreme Data’
–  or 'x' as a variable (big, fast, diverse)

Streams
Spring XD
HTTP

Tail

File

Mail

Twi,er

Gemﬁre

Syslog

TCP

UDP

JMS

RabbitMQ

MQTT

Trigger

ka?a

jdbc

Reactor
TCP/UDP

Filter

Transformer

Object-‐to-‐JSON

JSON-‐to-‐Tuple

Spli,er

Aggregator

HTTP
Client

Groovy
Scripts

Java
Code

JPMML
Evaluator

File

HDFS

JDBC

Mongo

TCP

Log

Mail

RabbitMQ

Gemﬁre

Splunk

MQTT

Dynamic
Router

Counters

Redis

Ka?a

What is Spark Streaming?
Ÿ  Extends Spark for doing large scale stream processing
Ÿ  Scales to 100s of nodes and achieves second scale latencies
Ÿ  Efficient and fault-tolerant stateful stream processing
Ÿ  Integrates with Spark’s batch and interactive processing
Ÿ  Provides a simple batch-like API for implementing complex
algorithms

Discretized Stream Processing
Run a streaming computation as a series of very small,
deterministic batch jobs
21

Spark

Spark

Streaming

batches
of
X
seconds

live
data
stream

processed

results

§  Chop
up
the
live
stream
into
batches
of
X

seconds

§  Spark
treats
each
batch
of
data
as
RDDs
and

processes
them
using
RDD
operaSons

§  Finally,
the
processed
results
of
the
RDD

operaSons
are
returned
in
batches

Discretized Stream Processing
Run a streaming computation as a series of very small,
deterministic batch jobs
22

§  Batch
sizes
as
low
as
½
second,
latency
~
1

second

§  PotenSal
for
combining
batch
processing
and

streaming
processing
in
the
same
system

Spark

Spark

Streaming

batches
of
X
seconds

live
data
stream

processed

results

Pivotal HD Value
•  Cost-based Query Optimizer
•  ANSI SQL Compliant
•  Linear, incremental scalability on
commodity/appliance hardware
•  Deep Analytic OLAP Queries
•  Petabyte Data Storage &
Management
•  Low latency updates and
transactions
•  Active-active deployment across
WAN
OLAP OLTP
SQL
HDFS

Pivotal HAWQ Value
Ÿ  ORCA – New Query Optimizer
Ÿ  Open Data Format (Parquet)
Ÿ  Additional Analytics
–  PL/PGSQL, PL/R, PL/PYTHON
Ÿ  Security – Kerberos authentication
support
Ÿ  Updated Diagnostic Tools
Ÿ  Automated High Availability

GemFire – The Enterprise Data Fabric
– A distributed, memory-based data management platform.
–  Gartner -> In Memory Data Grid (IMDG)
– ACID Transactional behaviour on IMDG
– Provides continuous availability, high performance, and
linear scalability for data intensive applications.
– Allows for configurable data consistency.
– Event driven data architecture
25

GemFire – The Enterprise Data Fabric
26
Pivotal GemFire Data Fabric
Reliable Notification
High Scalability WAN Distribution
Continuous Querying Parallel Execution
Continuous Availability Low latency
Data Durability
Enterprise data consuming application
Conventional data
storage systemsFile Database
Other data
Storage System

DEMO SCREENSHOTS
Note - All products are installed on 1 VM

Reports total number of streams
process, total number of streams
that lacks data and how much time
did spark took to process data
collected in 10 seconds window.
This data is retrieved from Gemfire
Reports top routes and number of
trips in those routes, pickup and
dropoff time. This data is for last 10
seconds. Refreshes every 10
seconds. This data is retrieved from
Gemfire

Changed data in the next 10
seconds window. This data is
retrieved from Gemfire

Visual representation of top three
routes.
Number 1 is blue
Number 2 is green
Number 3 is pink
Straight pin is origin square and
tilted pin is ending square.
This data is retrieved from Gemfire

Top 50 taxies available at various
coordinates in last 10 seconds of
data processed. This data is

This showcases power of HAWQ.
Recall that streamed data is put
into HDFS as well. And we use
SQL queries to query on data
stored on HDFS.

Pivotal Big Data Suite
ü Open Data Platform
ü Suite of Analytical Products
ü Team of Data Scientists
ü Open Source Commitment
ü Enterprise Level Support
ü Best in class In-Memory Data grid solution
ü ONE platform for apps, data and mobile services

Real-Time Taxi Analytics with Pivotal Big Data Suite

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Real-Time Taxi Analytics with Pivotal Big Data Suite

Semelhante a Real-Time Taxi Analytics with Pivotal Big Data Suite (20)

Último

Último (20)

Real-Time Taxi Analytics with Pivotal Big Data Suite