Big Data, a recent phenomenon. Everyone talks about it, but do you really know what Big Data is? Join our four-part series about Big Data and you will get answers to your questions!
We will cover Introduction to Big Data and available platforms which we can use to deal with Big Data. And in the end, we are going to give you an insight into the possible future of dealing with Big Data.
After the two previous episodes you know the basics about Big Data. Yet, it might get a bit more complicated than that. Usually when you have to deal with data which is generated in real-time. In this case, you are dealing with Big Stream.
This episode of our series will be focussed on processing systems capable of dealing with Big Streams. But analysing data lacking graphical representation will not be very convenient for us. And this is where we have to use a platform capable of visualising Big Graphs. All these topics will be covered in today’s presentation.
#CHEDTEB
www.chedteb.eu
1. Big Stream
Processing Systems
&
Big Graphs
Based on presentations during
Brno Data Week 2018
by prof Sherif Sakr
Created by: Tichý, T. & Luhan, J.
(Feb. 2019)
https://www.chedteb.eu/
2. Static vs
Streaming
Data
Computation
• Today, in several applications data is continuously produced
(e.g., user activity logs, web logs, sensors, database transactions, ...).
• Streaming processing engines analyze data while it arrives.
• The main goal of stream processing is to decrease the overall latency
to obtain results.
Big stream
3. Big streams
• In 2010, Walmart reported that it was handling more than 1 million
customer transactions every hour.
• The New York Stock Exchange (NYSE) reported trading more than 800
million shares on a typical day in October 2012.
• By the end of 2011, there were about 30 billion Radio-Frequency
Identification (RFID) tags.
• In all of these applications and domains, there is a crucial requirement
to collect, process and analyse big streams of data in a real time
fashion.
Big stream
4. Can we use Hadoop for Big Streams?
• From the stream-processing point of view, the main limitation
of Hadoop is that it was designed so that the entire output of
each map and reduce task is materialized into a local file before
it can be consumed by the next stage.
• This materialization step enables the implementation of a
simple and elegant checkpoint/restart fault-tolerance
mechanism. But it causes a significant delay for jobs with real-
time processing requirements.
Big stream: Processing systems
5. Apache Storm
• Storm is a real-time distributed computing framework for reliably
processing unbounded data streams.
• Storm is a project which was created by Nathan Marz and his team at
BackType, and released as open source in 2011 after BackType was
acquired by Twitter.
• Part of Apache Incubator since September 2013.
• Provides general primitives to do real time computations.
Big stream: Processing systems
6. Big graphs
• While it is great that we can analyse a huge
amout of data, it would not be useful without
some kind of a graphical presentation of this
data.
• BigData = BigGraphs.
• We use a lot of algorithms to visualize our data
Big graphs
7. Examples of graph
processing algorithms
• PageRank
• Triangle Counting
• Connected Components
• Random Walk
• Graph Coloring
• Community Detection
• and many others
Big graphs
8. Main
challenges of
graph
processing
Data is dynamic -> No way of doing "schema on write"
Structure driven computation -> Poor Memory Locality
and Data Transfer Issues
Algorithms are explorative and iterative
Combinatorial explosion of datasets -> Relationships
Grow Exponentially and Limited Scalability
Irregular Structure -> Challenging Graph Partitioning
and Limited Parallelism
Big graphs
9. Can we use Hadoop for Big Graphs?
• MapReduce does not directly support iterative
algorithms.
• Invariant graph-topology-data re-loaded and re-processed
at each iteration -> wasting I/O, network bandwidth, and
CPU.
• Materializations of intermediate results at every
MapReduce iteration harm performance.
Big graphs
10. An Overview
of Big Graph
Processing
Systems
Big graphs: Processing systems
11. Google Pregel
• The first BSP-based implementation for graph processing
• Communication through message passing (usually sent along
the outgoing edges from each vertex) + Shared-Nothing
• Advantages:
• No locks -> message-based communication
• No semaphores -> global synchronization
• Iteration isolation -> massively parallelizable
Big graphs: Processing systems
12. GraphX
• A distributed graph engine built on top of Spark;
• GraphX extends Sparks Resilient Distributed Dataset (RDD)
abstraction to introduce the Resilient Distributed Graph
(RDG).
• The GraphX RDG leverages advances in distributed graph
representation and exploits the graph structure to minimize
communication and storage overhead.
Big graphs: Processing systems
13. GraphX
• One system for the entire graph pipeline. Unlike other graph
processing systems, the GraphX API enables the composition
of graphs.
• Tables and graphs are views of the same physical data.
• Each view has its own operators that exploit the semantics of
the view to achieve efficient execution.
Big graphs: Processing systems
14. To be
continued
Our last episode of our series will
be focussed on machine learning
in Big Data and challenges we
have to face in Big Data.