SlideShare a Scribd company logo
1 of 72
Download to read offline
Data Streaming 
Raja Chiky – raja.chiky@isep.fr
About me 
¡ Associate professor in Computer Science – LISITE-RDI 
¡ Research interest: Data stream mining, scalability and resource optimization in distributed architectures 
(e.g cloud architectures), recommender systems 
¡ Research field: Large scale data management 
1. Real-time and 
distributed 
processing of 
various data 
sources 
2. Use semantic 
technologies to 
add a semantic 
layer 
3. Recommender 
systems and 
collaborative data 
mining 
4. Optimizing resources in large scale systems 
Heterogeneous 
and 
sta1c 
data 
Heterogeneous 
and 
dynamic 
data 
streams 
sensors 
2
OUTLINE 
¡ Context: Big Data 
¡ What is a data stream ? 
¡ Data stream management systems 
¡ Basic approximate algorithms 
¡ Big Data: Distributed Systems 
¡ Semantic Data Streaming 
¡ Conclusion 
3
New era 
4
5 Big Data: Buzzword!
6 
08/12/2014 
Where is all this data coming 
from?
7 
More and More connected 
Things
8 So, what is Big Data? 
Dawn 
of 
(me 
Volume 
of 
data 
created 
Worldwide 
2003 
2012 
5 
EB 
… 
2.7 
ZB 
2015 
10 
ZB 
(E) 
§ 1 
YB 
= 
10^24 
Bytes 
§ 1 
ZB 
= 
10^21 
Bytes 
§ 1 
EB 
= 
10^18 
Bytes 
§ 1 
PB 
= 
10^15 
Bytes 
§ 1TB 
= 
10^12 
Bytes 
§ 1 
GB 
= 
10^9 
Bytes 
Variety 
of 
data 
§ Radio 
§ TV 
§ News 
§ E-­‐Mails 
§ Facebook 
Posts 
Velocity 
of 
data 
§ Walmart 
handles 
1M 
transac(ons 
per 
hour 
§ Google 
processes 
24PB 
of 
data 
per 
day 
§ AT&T 
transfers 
30 
PB 
of 
data 
per 
day 
§ 90 
trillion 
emails 
are 
sent 
per 
year 
§ World 
of 
WarcraQ 
uses 
1.3 
PB 
of 
storage 
§ Tweets 
§ Blogs 
§ Photos 
§ Videos 
(user 
and 
paid) 
§ RSS 
feeds 
§ Wikipedia 
§ GPS 
data 
§ RFID 
§ POS 
Scanners 
§ … 
§ Facebook 
when 
had 
a 
user 
base 
of 
900 
M 
users, 
had 
25 
PB 
of 
compressed 
data 
§ 400M 
tweets 
per 
day 
in 
June 
’12 
§ 72 
hours 
of 
video 
is 
uploaded 
to 
Youtube 
every 
minute 
Big 
Data 
Elements 
Volume 
Variety 
Velocity 
+ Veracity (IBM) - 
information 
uncertainty 
Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla
Output 
User 
Interaction 
Store 
Gathering 
Information 
Data 
sources 
Static data Stream (big) data 
C 
Continuous 
queries/ 
Business rules 
sensors 
databases 
Data stream Static data 
ETL 
Batch 
processing 
Semantic ETL 
stream 
processing 
Ad-hoc queries 
Analytics 
Knowledge 
enrichment 
Databases/ 
Triplestores 
(synopsis) 
9 
Real time 
visual-analytics 
Retro-action 
Load shedding 
Data Warehouse 
08/12/2014
10 Big Data : Velocity 
Website logs 
Network 
monitoring Financial services 
eCommerce Traffic control 
Weather 
forecasting 
Power 
consumption
What is a data stream? 
11 
¡ Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered 
(implicitly by arrival time or explicitly by timestamp) sequence of items. It is 
impossible to control the order in which items arrive, nor is it feasible to locally 
store a stream in its entirety.” 
¡ Massive volumes of data, items arrive at a high rate.
12 
Applications of data stream 
processing 
¡ Data stream processing 
¡ Process queries (compute statistics, activate alarms) 
¡ Apply data mining algorithms 
¡ Requirements 
¡ Real-time processing 
¡ One-pass processing 
¡ Bounded storage (no complete storage of streams) 
¡ Possibly consider several streams
13 
Applications of data stream 
processing 
¡ Let’s go deeper into some examples 
¡ Network management 
¡ Stock monitoring
14 Network management 
¡ Supervision of a computer network 
¡ Improvement of network configuration (hardware, software, 
architecture) 
¡ Detection of attacks 
¡ Measurements made on routers 
Network supervision 
Huge volume of data center 
High rate of arrivals
15 Network management 
Network supervision 
center 
Timestamp Source Destination Duration Bytes Protocol 
… … … … … … 
12342 10.1.0.2 16.2.3.7 12 20K http 
12343 18.6.7.1 12.4.0.3 16 24K http 
12344 12.4.3.8 14.8.7.4 26 58K http 
12345 19.7.1.2 16.5.5.8 18 80K ftp 
… … … … … …
16 Network management 
Network supervision 
center 
Typical queries: 
- 100 most frequent (@S, @D) on router R1 … 
- How many different (@S, @D) seen on R1 but not R2 … 
- … during last month, last week, last day, last hour ?
17 Stock monitoring 
¡ Stream of price and sales volume of stocks over time 
¡ Technical analysis/charting for stock investors 
¡ Support trading decisions 
l Notify me when the price of IBM is above $83, and 
the first MSFT price afterwards is below $27. 
l Notify me when some stock goes up by at least 5% 
from one transaction to the next. 
l Notify me when the price of any stock increases 
monotonically for ≥30 min. 
l Notify me when the difference between the 
current price of a stock and its 10 day moving 
average is greater than some threshold value 
Source: Gehrke 07 and Cayuga application scenarios (Cornell University)
18 Where is the problem? (1/2) 
¡ Example: 
05/12/2014 
Bank 
withdrawal 
50 € 
12/05/2014 
Bank 
fraud bank 
withdrawal 
1000$ 
12/05/2014 
¡ Join between several streams 
¡ Join between stream data and customer database 
¡ Generic tools for processing streams 
¡ Avoid the ‘Store’, ‘Compute’, ‘Delete’ approach 
¡ Solution: incremental computation and definition of temporal windows for joins
19 Where is the problem? (2/2) 
¡ Example: 
¡ 100 most frequent @S IP adresses on a router 
¡ Maintain a table of IP addresses with 
frequencies ? 
¡ Sampling the stream ? 
¡ Face high (and varying) rate of arrivals 
¡ Exact versus approximate answers
20 Examples of queries 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
Find elements with 
frequency> 0.1% 
top-k 
Frequency of the element 3 
Total frequency of 
Elements between 8 and 14 
number of elements having a non-zero frequency
21 
Data Stream Management 
Systems 
DBMS DSMS 
Data model Permanent updatable relations Streams and permanent updatable 
relations 
Storage Data is stored on disk Permanent relations are stored on disk 
Streams are processed on the fly 
Query SQL language 
Creating structures 
Inserting/updating/deleting data 
Retrieving data (one-time query) 
SQL-like query language 
Standard SQL on permanent relations 
Extended SQL on streams with 
windowing 
Continuous queries 
Performance Large volumes of data Optimization of computer resources to 
deal with 
Several streams 
Several queries 
Ability to face variations in arrival rates 
without crash
22 Generic DSMS architecture 
Input 
Monitor 
Output 
Buffer 
Query Processor 
Query 
Reposi-tory 
Working 
Storage 
Summary 
Storage 
Static 
Storage 
Streaming 
Inputs 
Streaming 
Outputs 
Updates to 
Static Data 
User 
Queries 
Golab & Oszu (2003)
23 Existing DSMS
24 
How to deal with Big Data 
Streams 
Distribution 
DSMS 
Sampling/Load Shedding/Sketch/…
25 
Approximate answers to 
queries 
¡ When ? 
¡ Queries needing unbounded memory 
¡ Too much queries/too rapid streams/too high response time 
requirements 
¡ CPU limit 
¡ Memory limit 
¡ Solution : approximate answers to queries 
¡ Sliding windows 
¡ Sampling and load shedding 
¡ Definition of synopsis
26 Approaches 
¡ Two approaches for handling such streams 
¡ Use a time window, and query the window as a static table 
¡ When you can’t store collected data, or to keep track of historical 
data 
¡ Sampling 
¡ Filtering 
¡ Counting
Windowing 
¡ Applying queries/mining tasks to the whole stream (from 
beginning to current time) 
¡ Applying queries/mining to a portion of the stream 
Beginning of the stream 
Current date 
Window on the stream 
t
29 Windowing 
¡ Definition of windows of interest on streams 
¡ Fixed windows: September 2014 
¡ Sliding windows: last 3 hours 
¡ Landmark windows: from September 1st, 2014 
08/12/2014 
¡ Window specification 
¡ Physical : last 3 hours 
¡ Logical : last 1000 items 
¡ Refreshing rate 
¡ Rate of producing results (every item, every 10 items, every minute, 
…)
Sliding window 
Beginning of the stream 
t 
tc 
t’ t c 
Refreshment time 
Results 
Results 
Give me the last room where Axel has been in 
the last 10 minutes, updating results every minute 
30
Sliding window vs. Tumbling 
window 
Beginning of the stream 
t 
tc 
t’c t 
Results 
Refreshment time 
Results 
Give me the last room where Axel has been in 
the last 10 minutes, updating results every 10 
minutes 
31
32 Sampling from data stream 
¡ Inputs: 
¡ Sample size k 
¡ Window size n >> k (alternatively, time duration m) (optionnaly) 
¡ Stream of data elements that arrive online 
¡ Output: 
¡ k elements chosen uniformly at random from the last n elements 
(alternatively, from all elements that have arrived in the last m time 
units) 
¡ Goal: 
¡ maintain a data structure that can produce the desired output at 
any time upon request 
¡ Challenge: 
¡ don’t know how long stream is 
¡ So when/how often to sample?
A simple, Unsatisfying 
Approach 
¡ Choose a random subset X={x1, …,xk}, X⊂{0,1,…,n-1} 
¡ The sample always consists of the non-expired elements whose 
indexes are equal to x1, …,xk (modulo n) 
¡ Only uses O(k) memory 
¡ Technically produces a uniform random sample of each 
window, but unsatisfying because the sample is highly periodic 
¡ Unsuitable for many real applications, particularly those with 
periodicity in the data 
34
35 Reservoir sampling 
¡ Classic online algorithm due to Vitter (1985) 
¡ Maintains a fixed-size uniform random sample 
¡ Size of the data stream need not be known in advance 
¡ Data structure: “reservoir” of k data elements 
¡ As the ith data element arrives: 
¡ Add it to the reservoir with probability p = k/i, discarding a randomly 
chosen data element from the reservoir to make room
36 Chain Sampling 
v Chain Sampling method is for sequence based windows. 
v In this type of sampling when the ith element arrives it is chosen 
to become the sample with probability Min(i,n)/n. 
v If the ith element is chosen as the sample, the algorithm also 
selects the index of the element that will replace it when expires 
(assuming that it is still present in the sample when it expires). 
This index is picked uniformly at random from the range i+1…i 
+n, representing the range of indexes of the elements that will 
be active when the ith element expires. 
v When the element with the selected index arrives, the algorithm 
stores it in the memory and choses the index of the element 
that will replace it when it expires etc., building a chain of 
elements to use in case of the expiration of the current element 
in the sample.
37 Chain Sampling 
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
Another Simple Approach: 
oversample 
¡ As each element arrives remember it with probability 
p = ck/n log n; otherwise discard it 
¡ Discard elements when they expire 
¡ When asked to produce a sample, choose k elements at 
random from the set in memory 
¡ Expected memory usage of O(k log n) 
¡ The algorithm can fail if less than k elements from a window are 
remembered; however with high probability this will not happen 
38
39 Min-wise Sampling 
¡ For each item, pick a random fraction between 0 and 1 
¡ Store item(s) with the smallest random tag (Nath et al. 2004) 
0.391 0.908 0.291 0.555 0.619 0.273 
Each item has same chance of least tag, so uniform 
Can run on multiple streams separately, then merge
40 Hash function - Revision 
¡ Hash function: Any algorithm that maps data of arbitrary length 
to data of a fixed length. The values returned by a hash function 
are called hash values, hash codes, hash sums, checksums or 
simply hashes. 
08/12/2014
41 Stream filtering 
¡ Compact data structure, by B.H. Bloom, 1970: 
¡ A bit array of size m, 
¡ A H family of k hash functions (MD5, SHA256, Murmur), 
¡ A set S of n items 
¡ False positive probability f=(1-e-kn/m)k 
¡ Example M=10, hash=MD5, k=3
42 Sketches 
¡ Sketch 
¡ Synopsis structure taking advantage of high volumes of data 
¡ Provides an approximate result with probabilistic bounds 
¡ Random projections on smaller spaces (hash functions) 
¡ Many sketch structures: usually dedicated to a specialized task 
¡ Examples of sketch structures 
¡ COUNT (Flajolet 85) 
¡ COUNT SKETCH (Charikar et al. 04)
Difference between Sampling 
and Sketching 
43 
¡ Sample sees only those items which were selected to be in the 
sample; whereas the sketch sees the entire input, but is restricted 
to retain only a small summary of it 
¡ Not every problem can be solved with sampling 
¡ Example: counting how many distinct items in the stream 
¡ If a large fraction of items aren’t sampled, don’t know if they are all 
same or all different
44 
Sketches: COUNT (Flajolet 
Martin algorithm) 
¡ Goal 
¡ Number N of distinct values in a stream (for large N) 
¡ Example: number of distinct IP addresses going through a router 
¡ Sketch structure 
¡ SK: L bits initialized to 0 
0 0 0 0 0 0 0 0 
¡ H: hashing function transforming an element of the stream into L bits 
18.6.7.1 0 0 1 0 1 0 1 0 
¡ H distributes uniformly elements of the stream on the 2^L possibilities
45 Sketches 
¡ Method 
¡ Maintenance and update of SK 
¡ For each new element e 
¡ Compute H(e) 
¡ Select the position of the leftmost 1 in H(e) 
¡ Force to 1 this position in SK 
0 1 0 0 1 0 0 1 
SK 
H(18.6.7.1) 0 0 1 0 1 0 1 0 
New SK 0 1 1 0 1 0 0 1
46 Sketches 
¡ Result 
¡ Select the position R (0…L-1) of the leftmost 0 in SK 
¡ E(R) = log2 (φ*N) with φ = 0.77351… 
è Estimate the count by 2^R/φ 
¡ σ(R) = 1.12 
SK 1 1 1 0 1 0 0 0 
R 
To improve accuracy of this approximation algorithm, use multiple 
hash functions and use the average R instead.
47 Sketches 
¡ COUNT SKETCH ALGORITHM (Charikar et al. 2004) 
¡ Goal 
¡ k most frequent elements in a stream (for large number N of distinct 
values) 
¡ Ex. 100 most frequent IP addresses going through a router 
Input stream 2, 0, 1, 3, 1, 2, 4, . . . 
2 2 
1 1 1 
f(0) f(1) f(2) f(3) f(4)
48 Sketches 
+12 +7 +23 +15 
-5 -12 -23 +1 
. . . . 
. . . . 
. . . . 
. . . . 
. . . . 
+78 +56 +66 +65 
1 
2 
. 
. 
. 
. 
. 
B 
1 2 … t 
e 
-1 
+1 
-1 
+1
49 Sketches 
¡ Sketch structure 
¡ h : hash function from [0, … , N-1] to [0, 1, … , B] 
¡ s : hash function from [0, … , N-1] to {+1, -1} 
¡ Array of B counters: C1, …, CB (with B << N) 
¡ Sketch maintenance 
when e arrives: Ch(e) += s(e) 
¡ Use of sketch 
¡ Estimation of frequency of object e: : ne ≈ Ch(e) . s(e) 
¡ Actually t hash function h and t hash function s: 
ne ≈ median j∈[1…t] ( Chj(e) . sj(e) ) 
¡ Theoretical results on error depending on N, t and B.
50 Sketches 
¡ Algorithm 
¡ Maintenance of a list (e1, e2, …, ek) of the current k most frequent 
elements 
¡ For a new arriving element e 
¡ Add e to the sketch structure 
¡ Estimate frequency of e from the sketch structure 
¡ If f(e) > f(ek), remove ek and insert e into the list
Distributed 
Systems 
Answering today Big Data Needs 
S4, Storm, SAMOA, … 
51
Yahoo S4 
¡ Simple Scalable Streaming System 
¡ Use a decentralized and symmetric architecture 
¡ All nodes are similar, no master node 
¡ Based on Processing Elements 
¡ Each PE has 4 components: 
1. Functionality defined by a PE class and associated configuration 
¡ input event handler processEvent() 
¡ output mechanism output() 
2. the types of events that it consumes 
3. the keyed attribute in those events 
4. the value of the keyed attribute in events which it consumes 
¡ Several PEs are available for standard tasks such as count, aggregate, 
join … 
¡ Custom PEs can easily be programmed 
¡ The PEs are atomatically deployed on the Processing Nodes (machines): 
¡ Ensures load balancing of events, 
¡ Notifies appropriate PEs when an event comes in.
S4 Example: Word Count
Twitter Storm 
¡ Same princple as S4, with a simplified programming model 
¡ Storm provides realtime computation 
¡ Scalable 
¡ Guarantees no data loss 
¡ Extremely robust and fault-tolerant 
¡ Programming language agnostic 
¡ Concepts 
¡ Streams 
¡ Spouts 
¡ Bolts 
¡ Topologies
56 
Lambda architecture (By 
Nathan Marz) 
REAL TIME 
STREAM PROCESSING 
SERVING 
LAYER 
SPEED LAYER 
PRECOMPUTED 
DATA FLOW QUERIES 
BATCH LAYER 
BATCH 
PROCESSING 
VIEWS 
Source: Mathieu DESPRIEE (USI) 
Generic, scalable and fault-tolerant data processing architecture
Lambda architecture 
1. All data entering the system is dispatched to both the batch 
layer and the speed layer for processing. 
2. The batch layer has two functions: (i) managing the master 
dataset (an immutable, append-only set of raw data), and (ii) 
to pre-compute the batch views. 
3. The serving layer indexes the batch views so that they can be 
queried in ad-hoc way. 
4. The speed layer compensates for the high latency of updates 
to the serving layer and deals with recent data only. 
5. Any incoming query can be answered by merging results from 
batch views and real-time views. 
http://lambda-architecture.net/
58 Big Data Stream Mining 
Machine 
Learning 
Distributed 
Batch 
Hadoop 
Mahout 
Stream 
S4, Storm 
SAMOA 
Non 
Distributed 
Batch 
R, 
WEKA, 
… 
Stream 
MOA
59 
What is SAMOA? 
• NEW Software framework for mining distributed data streams 
• Big Data mining for evolving streams in REAL-TIME
Clustering 
Methods 
SAMOA 
SAMOA architecture 
Classifier 
Methods 
Frequent 
Pattern 
Mining 
S4 Storm … 
} Use S4, Storm, or 
other distributed 
stream processing 
platform 
} Use MOA, or other 
streaming machine 
learning library 
} Easy to extend 
through PACKAGES
Semantic data 
streaming 
61
62 Too much data streams
63 
Type of data used in Big Data 
initiatives 
Internal data 
Traditional sources 
« New data » 
Source: Big Data opportunities survey, Unisphere / SAP, May 2013.
64 BI: New Generation 
08/12/2014 
Decision 
Monitoring, Alerts, Statistics, fault 
detection, etc. 
Semantic filtering and Continuous queries 
Heterogeneous and 
dynamic data streams 
Heterogeneous and 
static data 
sensors 
Semantic data streams 
Interconnection 
Ontologies 
Retro-action
65 
Semantic Web technologies for 
data stream 
¡ Annotate stream data with semantic metadata 
¡ Apply Linked Data principles to publish streaming data 
¡ Interlink streaming data with existing datasets 
¡ Integrate data stream processing + reasoning 
¡ Objectives : interoperability, automation, enrichment
66 Existing prototypes 
CQELS 
SPARQL-Stream 
EP-SPARQL 
C-SPARQL
67 Approaches 
In RDF Stream 
models 
(timestamps, 
events, time 
intervals, triple-based, 
graph-based 
…)
68 Example: SRBench 
Q: Detect if a hurricane has been observed 
“A hurricane has a sustained wind (for more than 3 hours) of at 
least 33 metres per second or 74 miles per hour (119 km/h).” 
ASK 
WHERE { 
STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 10800s SLIDE 600s] 
{?observation om-owl:procedure ?sensor ; 
om-owl:observedProperty weather:WindSpeed ; 
om-owl:result [ om-owl:floatValue ?value ] .} 
} 
GROUP BY ?sensor 
HAVING ( AVG(?value) >= "74"^^xsd:float ) 
ASK 
FROM STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 3h STEP 10m] 
WHERE { 
?observation om-owl:procedure ?sensor ; 
om-owl:observedProperty weather:WindSpeed ; 
om-owl:result [ om-owl:floatValue ?value ] . } 
GROUP BY ?sensor 
HAVING ( AVG(?value) >= "74"^^xsd:float )
69 
How to deal with Big Data 
Streams 
Cloud Computing 
DSMS 
Sampling/Load Shedding 
[1] et [2] 
[3] 
[1] J. Hoeksema & S. Kotoulas : High-performance Distributed Stream Reasoning using S4 (ISWC 2011) 
[2] D. L. Phuoc & al : Elastic and Scalable Processing of Linked Stream Data in the Cloud (ISWC 2013) 
[3] N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and Limited Storage Issues. (DaEng 2013)
70 
Sampling Extensions for 
continuous SPARQL 
PREFIX vocab: http://data-gov.tw.rpi.edu/vocab/p/8/ 
SELECT ?val 
WHERE { 
STREAM <C:/CQELS/streams/data-8.stream>[NOW][UNISAMPLING %80] 
{?rawData vocab:ozone_8hr_daily_max ?val } 
GRAPH <C:/CQELS/test.rdf> {?user sioc:account_of ?person} 
} 
Operators: [UNISAMPLING %{Sampling Percentage}] 
[RESSAMPLING %{Reservoir Size}] 
[CHNSAMPLING %{Window Size}] 
N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and 
Limited Storage Issues. (DaEng 2013)
71 
Semantic Data Stream Load 
Shedding (1/2) 
Observation_AirTemperature_ID 
TemperatureObservation type 
AirTemperature 
observedProperty 
procedure 
System_ID 
MeasureData_AirTemperature_ID 
result 
Instant_ID 
NN 
floatValue double 
fahrenheit 
MeasureData 
uom 
Instant 
type 
2004-08-08T06:25:00 
inXSDDateTime 
samplingTime 
Observation_RelativeHumidity_ID 
procedure 
RelativeHumidityObservation 
type 
RelativeHumidity 
observedProperty 
MeasureData_RelativeHumidity_ID 
result 
type 
floatValue 
percentage 
uom 
NN 
double 
(a) 
(b) 
(c) 
(d) 
(1) 
(2) 
RDF Triple approach : 
Effect of deleting the two triples below (1) and 
(2) (in dotted line) in the graph : 
- The deletion of the first RDF triple destroys the 
link connecting node (a) to node (b) 
- The deletion of the second RDF triple destroys 
the link connecting node (c) to node (d) 
=> This has the effect of making nodes (b), (d) 
and all those connected to them 
unreachables, in spite of the presence of their 
data in memory. Which represents 6 RDF 
striples among 18 i.e. 33% of unusable data
72 
Semantic Data Stream Load 
Shedding (2/2) 
RDF Graph approach : 
Effect of deleting sub-graphs such as those formed by nodes (b) or (d) and all nodes 
to which they are directly connected. 
=> Preserving the semantic level of the information 
=> Protecting the data consistency of the whole graph 
=> Enhancing the Semantic Data Stream systems 
observedProperty
73 
Conclusion: Big Data Stream 
challenges 
¡ Semantic Information aggregation 
¡ Information aggregation: “too much data to assimilate but not 
enough knowledge to act” 
¡ Distributed and real-time processing 
¡ Design of real-time and distributed algorithms for stream processing 
and information aggregation 
¡ Distribution and parallelization of data mining algorithms 
¡ Visual analytics and user modeling 
¡ Dynamic user model 
¡ Novel visualizations for very large datasets
74 
Thanks to 
Zakia Kazi Aoul, ISEP 
Marie-Aude Aufaure, ECP 
Fethi Belghaouti, ISEP-INT 
Georges Hébrail, EDF R&D 
Sylvain Lefebvre, ISEP 
Yousra Chabchoub, ISEP
Big 
Data 
Linked 
Data 
Volume, 
Variety, 
Velocity, 
Veracity, 
… 
Value 
Web 
of 
data, 
Seman(c 
Web 
-­‐ A 
set 
of 
principles 
and 
good 
prac1ces 
allowing 
to 
link, 
publish 
and 
search 
for 
web 
data 
-­‐ Structure 
and 
seman1cally 
enrich 
RDF 
data, 
with 
a 
very 
high 
scalability 
-­‐> 
Big 
Linked 
Data 
Integrate, 
aggregate, 
analyze, 
visualize 
large 
data 
sets, 
whatever 
is 
their 
type, 
provenance, 
speed 
of 
their 
flow 
… 
Big 
Linked 
Data 
Linked 
Big 
Data 
Seman8c 
Technologies 
Living 
Lab 
Linked 
& 
Big 
Data 
Academic 
Chair 
Our 
Value 
proposi8on 
– 
Seman1c 
aggrega1on 
from 
textual 
and 
non 
textual 
streams 
– 
Manage 
seman1c 
heterogeneity, 
real-­‐1me 
and 
distributed 
processing 
– 
Ensure 
data 
quality 
and 
veracity 
– 
Visual 
analy1cs

More Related Content

What's hot

Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesNatalino Busa
 
Multi media Data mining
Multi media Data miningMulti media Data mining
Multi media Data mininghome
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)DheerajPachauri
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introductionamiyadash
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisVincenzo Gulisano
 
4.5 mining the worldwideweb
4.5 mining the worldwideweb4.5 mining the worldwideweb
4.5 mining the worldwidewebKrish_ver2
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxRohanBorgalli
 
Google File System
Google File SystemGoogle File System
Google File Systemguest2cb4689
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Houw Liong The
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval Systemvimalsura
 
Principles of Data Visualization
Principles of Data VisualizationPrinciples of Data Visualization
Principles of Data VisualizationEamonn Maguire
 
OS UNIT 1 PPT.pptx
OS UNIT 1 PPT.pptxOS UNIT 1 PPT.pptx
OS UNIT 1 PPT.pptxPRABAVATHIH
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
Distributed file system
Distributed file systemDistributed file system
Distributed file systemAnamika Singh
 

What's hot (20)

Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
Multi media Data mining
Multi media Data miningMulti media Data mining
Multi media Data mining
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
4.5 mining the worldwideweb
4.5 mining the worldwideweb4.5 mining the worldwideweb
4.5 mining the worldwideweb
 
Resource management
Resource managementResource management
Resource management
 
Dimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptxDimension Reduction Introduction & PCA.pptx
Dimension Reduction Introduction & PCA.pptx
 
Google File System
Google File SystemGoogle File System
Google File System
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval System
 
Multimedia Mining
Multimedia Mining Multimedia Mining
Multimedia Mining
 
Principles of Data Visualization
Principles of Data VisualizationPrinciples of Data Visualization
Principles of Data Visualization
 
OS UNIT 1 PPT.pptx
OS UNIT 1 PPT.pptxOS UNIT 1 PPT.pptx
OS UNIT 1 PPT.pptx
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Chapter 3 - Processes
Chapter 3 - ProcessesChapter 3 - Processes
Chapter 3 - Processes
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Distributed file system
Distributed file systemDistributed file system
Distributed file system
 

Viewers also liked

Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.H. Kheir
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Yueshen Xu
 
Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataHash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataLuca Mastrostefano
 
Streaming from the cloud
Streaming from the cloudStreaming from the cloud
Streaming from the clouddmulford
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafkaconfluent
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream ProcessingZbigniew Jerzak
 
Migrating to Cloud - A Step by Step
Migrating to Cloud - A Step by Step Migrating to Cloud - A Step by Step
Migrating to Cloud - A Step by Step Imaginea
 

Viewers also liked (13)

Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
 
Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataHash - A probabilistic approach for big data
Hash - A probabilistic approach for big data
 
Streaming from the cloud
Streaming from the cloudStreaming from the cloud
Streaming from the cloud
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafka
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
Cloud Migration: Moving to the Cloud
Cloud Migration: Moving to the CloudCloud Migration: Moving to the Cloud
Cloud Migration: Moving to the Cloud
 
Migrating to Cloud - A Step by Step
Migrating to Cloud - A Step by Step Migrating to Cloud - A Step by Step
Migrating to Cloud - A Step by Step
 
Slides That Rock
Slides That RockSlides That Rock
Slides That Rock
 
cloud computing ppt
cloud computing pptcloud computing ppt
cloud computing ppt
 

Similar to Introduction to Data streaming - 05/12/2014

Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemErdi Olmezogullari
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014Raja Chiky
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Ververica
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingKostas Tzoumas
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thessaloniki
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor NetworksOscar Corcho
 
Jewei Hans & Kamber Chapter 8
Jewei Hans & Kamber  Chapter 8Jewei Hans & Kamber  Chapter 8
Jewei Hans & Kamber Chapter 8Houw Liong The
 
data streammining and its applications.ppt
data streammining and its applications.pptdata streammining and its applications.ppt
data streammining and its applications.pptajajkhan16
 
The State of Stream Processing
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processingconfluent
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Summit
 
CERN IT Monitoring
CERN IT Monitoring CERN IT Monitoring
CERN IT Monitoring Tim Bell
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
SQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightSQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightEduardo Castro
 
Parallel machines flinkforward2017
Parallel machines flinkforward2017Parallel machines flinkforward2017
Parallel machines flinkforward2017Nisha Talagala
 
Big Data - Umesh Bellur
Big Data - Umesh BellurBig Data - Umesh Bellur
Big Data - Umesh BellurSTS FORUM 2016
 

Similar to Introduction to Data streaming - 05/12/2014 (20)

Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management System
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
Stream Processing
Stream Processing Stream Processing
Stream Processing
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
 
Jewei Hans & Kamber Chapter 8
Jewei Hans & Kamber  Chapter 8Jewei Hans & Kamber  Chapter 8
Jewei Hans & Kamber Chapter 8
 
data streammining and its applications.ppt
data streammining and its applications.pptdata streammining and its applications.ppt
data streammining and its applications.ppt
 
The State of Stream Processing
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processing
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
CERN IT Monitoring
CERN IT Monitoring CERN IT Monitoring
CERN IT Monitoring
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
SQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsightSQL Server 2008 R2 StreamInsight
SQL Server 2008 R2 StreamInsight
 
Parallel machines flinkforward2017
Parallel machines flinkforward2017Parallel machines flinkforward2017
Parallel machines flinkforward2017
 
Big Data - Umesh Bellur
Big Data - Umesh BellurBig Data - Umesh Bellur
Big Data - Umesh Bellur
 

Recently uploaded

Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxFinatron037
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxdhiyaneswaranv1
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 

Recently uploaded (16)

Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptx
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in Logistics
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 

Introduction to Data streaming - 05/12/2014

  • 1. Data Streaming Raja Chiky – raja.chiky@isep.fr
  • 2. About me ¡ Associate professor in Computer Science – LISITE-RDI ¡ Research interest: Data stream mining, scalability and resource optimization in distributed architectures (e.g cloud architectures), recommender systems ¡ Research field: Large scale data management 1. Real-time and distributed processing of various data sources 2. Use semantic technologies to add a semantic layer 3. Recommender systems and collaborative data mining 4. Optimizing resources in large scale systems Heterogeneous and sta1c data Heterogeneous and dynamic data streams sensors 2
  • 3. OUTLINE ¡ Context: Big Data ¡ What is a data stream ? ¡ Data stream management systems ¡ Basic approximate algorithms ¡ Big Data: Distributed Systems ¡ Semantic Data Streaming ¡ Conclusion 3
  • 5. 5 Big Data: Buzzword!
  • 6. 6 08/12/2014 Where is all this data coming from?
  • 7. 7 More and More connected Things
  • 8. 8 So, what is Big Data? Dawn of (me Volume of data created Worldwide 2003 2012 5 EB … 2.7 ZB 2015 10 ZB (E) § 1 YB = 10^24 Bytes § 1 ZB = 10^21 Bytes § 1 EB = 10^18 Bytes § 1 PB = 10^15 Bytes § 1TB = 10^12 Bytes § 1 GB = 10^9 Bytes Variety of data § Radio § TV § News § E-­‐Mails § Facebook Posts Velocity of data § Walmart handles 1M transac(ons per hour § Google processes 24PB of data per day § AT&T transfers 30 PB of data per day § 90 trillion emails are sent per year § World of WarcraQ uses 1.3 PB of storage § Tweets § Blogs § Photos § Videos (user and paid) § RSS feeds § Wikipedia § GPS data § RFID § POS Scanners § … § Facebook when had a user base of 900 M users, had 25 PB of compressed data § 400M tweets per day in June ’12 § 72 hours of video is uploaded to Youtube every minute Big Data Elements Volume Variety Velocity + Veracity (IBM) - information uncertainty Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla
  • 9. Output User Interaction Store Gathering Information Data sources Static data Stream (big) data C Continuous queries/ Business rules sensors databases Data stream Static data ETL Batch processing Semantic ETL stream processing Ad-hoc queries Analytics Knowledge enrichment Databases/ Triplestores (synopsis) 9 Real time visual-analytics Retro-action Load shedding Data Warehouse 08/12/2014
  • 10. 10 Big Data : Velocity Website logs Network monitoring Financial services eCommerce Traffic control Weather forecasting Power consumption
  • 11. What is a data stream? 11 ¡ Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.” ¡ Massive volumes of data, items arrive at a high rate.
  • 12. 12 Applications of data stream processing ¡ Data stream processing ¡ Process queries (compute statistics, activate alarms) ¡ Apply data mining algorithms ¡ Requirements ¡ Real-time processing ¡ One-pass processing ¡ Bounded storage (no complete storage of streams) ¡ Possibly consider several streams
  • 13. 13 Applications of data stream processing ¡ Let’s go deeper into some examples ¡ Network management ¡ Stock monitoring
  • 14. 14 Network management ¡ Supervision of a computer network ¡ Improvement of network configuration (hardware, software, architecture) ¡ Detection of attacks ¡ Measurements made on routers Network supervision Huge volume of data center High rate of arrivals
  • 15. 15 Network management Network supervision center Timestamp Source Destination Duration Bytes Protocol … … … … … … 12342 10.1.0.2 16.2.3.7 12 20K http 12343 18.6.7.1 12.4.0.3 16 24K http 12344 12.4.3.8 14.8.7.4 26 58K http 12345 19.7.1.2 16.5.5.8 18 80K ftp … … … … … …
  • 16. 16 Network management Network supervision center Typical queries: - 100 most frequent (@S, @D) on router R1 … - How many different (@S, @D) seen on R1 but not R2 … - … during last month, last week, last day, last hour ?
  • 17. 17 Stock monitoring ¡ Stream of price and sales volume of stocks over time ¡ Technical analysis/charting for stock investors ¡ Support trading decisions l Notify me when the price of IBM is above $83, and the first MSFT price afterwards is below $27. l Notify me when some stock goes up by at least 5% from one transaction to the next. l Notify me when the price of any stock increases monotonically for ≥30 min. l Notify me when the difference between the current price of a stock and its 10 day moving average is greater than some threshold value Source: Gehrke 07 and Cayuga application scenarios (Cornell University)
  • 18. 18 Where is the problem? (1/2) ¡ Example: 05/12/2014 Bank withdrawal 50 € 12/05/2014 Bank fraud bank withdrawal 1000$ 12/05/2014 ¡ Join between several streams ¡ Join between stream data and customer database ¡ Generic tools for processing streams ¡ Avoid the ‘Store’, ‘Compute’, ‘Delete’ approach ¡ Solution: incremental computation and definition of temporal windows for joins
  • 19. 19 Where is the problem? (2/2) ¡ Example: ¡ 100 most frequent @S IP adresses on a router ¡ Maintain a table of IP addresses with frequencies ? ¡ Sampling the stream ? ¡ Face high (and varying) rate of arrivals ¡ Exact versus approximate answers
  • 20. 20 Examples of queries 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find elements with frequency> 0.1% top-k Frequency of the element 3 Total frequency of Elements between 8 and 14 number of elements having a non-zero frequency
  • 21. 21 Data Stream Management Systems DBMS DSMS Data model Permanent updatable relations Streams and permanent updatable relations Storage Data is stored on disk Permanent relations are stored on disk Streams are processed on the fly Query SQL language Creating structures Inserting/updating/deleting data Retrieving data (one-time query) SQL-like query language Standard SQL on permanent relations Extended SQL on streams with windowing Continuous queries Performance Large volumes of data Optimization of computer resources to deal with Several streams Several queries Ability to face variations in arrival rates without crash
  • 22. 22 Generic DSMS architecture Input Monitor Output Buffer Query Processor Query Reposi-tory Working Storage Summary Storage Static Storage Streaming Inputs Streaming Outputs Updates to Static Data User Queries Golab & Oszu (2003)
  • 24. 24 How to deal with Big Data Streams Distribution DSMS Sampling/Load Shedding/Sketch/…
  • 25. 25 Approximate answers to queries ¡ When ? ¡ Queries needing unbounded memory ¡ Too much queries/too rapid streams/too high response time requirements ¡ CPU limit ¡ Memory limit ¡ Solution : approximate answers to queries ¡ Sliding windows ¡ Sampling and load shedding ¡ Definition of synopsis
  • 26. 26 Approaches ¡ Two approaches for handling such streams ¡ Use a time window, and query the window as a static table ¡ When you can’t store collected data, or to keep track of historical data ¡ Sampling ¡ Filtering ¡ Counting
  • 27. Windowing ¡ Applying queries/mining tasks to the whole stream (from beginning to current time) ¡ Applying queries/mining to a portion of the stream Beginning of the stream Current date Window on the stream t
  • 28. 29 Windowing ¡ Definition of windows of interest on streams ¡ Fixed windows: September 2014 ¡ Sliding windows: last 3 hours ¡ Landmark windows: from September 1st, 2014 08/12/2014 ¡ Window specification ¡ Physical : last 3 hours ¡ Logical : last 1000 items ¡ Refreshing rate ¡ Rate of producing results (every item, every 10 items, every minute, …)
  • 29. Sliding window Beginning of the stream t tc t’ t c Refreshment time Results Results Give me the last room where Axel has been in the last 10 minutes, updating results every minute 30
  • 30. Sliding window vs. Tumbling window Beginning of the stream t tc t’c t Results Refreshment time Results Give me the last room where Axel has been in the last 10 minutes, updating results every 10 minutes 31
  • 31. 32 Sampling from data stream ¡ Inputs: ¡ Sample size k ¡ Window size n >> k (alternatively, time duration m) (optionnaly) ¡ Stream of data elements that arrive online ¡ Output: ¡ k elements chosen uniformly at random from the last n elements (alternatively, from all elements that have arrived in the last m time units) ¡ Goal: ¡ maintain a data structure that can produce the desired output at any time upon request ¡ Challenge: ¡ don’t know how long stream is ¡ So when/how often to sample?
  • 32. A simple, Unsatisfying Approach ¡ Choose a random subset X={x1, …,xk}, X⊂{0,1,…,n-1} ¡ The sample always consists of the non-expired elements whose indexes are equal to x1, …,xk (modulo n) ¡ Only uses O(k) memory ¡ Technically produces a uniform random sample of each window, but unsatisfying because the sample is highly periodic ¡ Unsuitable for many real applications, particularly those with periodicity in the data 34
  • 33. 35 Reservoir sampling ¡ Classic online algorithm due to Vitter (1985) ¡ Maintains a fixed-size uniform random sample ¡ Size of the data stream need not be known in advance ¡ Data structure: “reservoir” of k data elements ¡ As the ith data element arrives: ¡ Add it to the reservoir with probability p = k/i, discarding a randomly chosen data element from the reservoir to make room
  • 34. 36 Chain Sampling v Chain Sampling method is for sequence based windows. v In this type of sampling when the ith element arrives it is chosen to become the sample with probability Min(i,n)/n. v If the ith element is chosen as the sample, the algorithm also selects the index of the element that will replace it when expires (assuming that it is still present in the sample when it expires). This index is picked uniformly at random from the range i+1…i +n, representing the range of indexes of the elements that will be active when the ith element expires. v When the element with the selected index arrives, the algorithm stores it in the memory and choses the index of the element that will replace it when it expires etc., building a chain of elements to use in case of the expiration of the current element in the sample.
  • 35. 37 Chain Sampling 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
  • 36. Another Simple Approach: oversample ¡ As each element arrives remember it with probability p = ck/n log n; otherwise discard it ¡ Discard elements when they expire ¡ When asked to produce a sample, choose k elements at random from the set in memory ¡ Expected memory usage of O(k log n) ¡ The algorithm can fail if less than k elements from a window are remembered; however with high probability this will not happen 38
  • 37. 39 Min-wise Sampling ¡ For each item, pick a random fraction between 0 and 1 ¡ Store item(s) with the smallest random tag (Nath et al. 2004) 0.391 0.908 0.291 0.555 0.619 0.273 Each item has same chance of least tag, so uniform Can run on multiple streams separately, then merge
  • 38. 40 Hash function - Revision ¡ Hash function: Any algorithm that maps data of arbitrary length to data of a fixed length. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. 08/12/2014
  • 39. 41 Stream filtering ¡ Compact data structure, by B.H. Bloom, 1970: ¡ A bit array of size m, ¡ A H family of k hash functions (MD5, SHA256, Murmur), ¡ A set S of n items ¡ False positive probability f=(1-e-kn/m)k ¡ Example M=10, hash=MD5, k=3
  • 40. 42 Sketches ¡ Sketch ¡ Synopsis structure taking advantage of high volumes of data ¡ Provides an approximate result with probabilistic bounds ¡ Random projections on smaller spaces (hash functions) ¡ Many sketch structures: usually dedicated to a specialized task ¡ Examples of sketch structures ¡ COUNT (Flajolet 85) ¡ COUNT SKETCH (Charikar et al. 04)
  • 41. Difference between Sampling and Sketching 43 ¡ Sample sees only those items which were selected to be in the sample; whereas the sketch sees the entire input, but is restricted to retain only a small summary of it ¡ Not every problem can be solved with sampling ¡ Example: counting how many distinct items in the stream ¡ If a large fraction of items aren’t sampled, don’t know if they are all same or all different
  • 42. 44 Sketches: COUNT (Flajolet Martin algorithm) ¡ Goal ¡ Number N of distinct values in a stream (for large N) ¡ Example: number of distinct IP addresses going through a router ¡ Sketch structure ¡ SK: L bits initialized to 0 0 0 0 0 0 0 0 0 ¡ H: hashing function transforming an element of the stream into L bits 18.6.7.1 0 0 1 0 1 0 1 0 ¡ H distributes uniformly elements of the stream on the 2^L possibilities
  • 43. 45 Sketches ¡ Method ¡ Maintenance and update of SK ¡ For each new element e ¡ Compute H(e) ¡ Select the position of the leftmost 1 in H(e) ¡ Force to 1 this position in SK 0 1 0 0 1 0 0 1 SK H(18.6.7.1) 0 0 1 0 1 0 1 0 New SK 0 1 1 0 1 0 0 1
  • 44. 46 Sketches ¡ Result ¡ Select the position R (0…L-1) of the leftmost 0 in SK ¡ E(R) = log2 (φ*N) with φ = 0.77351… è Estimate the count by 2^R/φ ¡ σ(R) = 1.12 SK 1 1 1 0 1 0 0 0 R To improve accuracy of this approximation algorithm, use multiple hash functions and use the average R instead.
  • 45. 47 Sketches ¡ COUNT SKETCH ALGORITHM (Charikar et al. 2004) ¡ Goal ¡ k most frequent elements in a stream (for large number N of distinct values) ¡ Ex. 100 most frequent IP addresses going through a router Input stream 2, 0, 1, 3, 1, 2, 4, . . . 2 2 1 1 1 f(0) f(1) f(2) f(3) f(4)
  • 46. 48 Sketches +12 +7 +23 +15 -5 -12 -23 +1 . . . . . . . . . . . . . . . . . . . . +78 +56 +66 +65 1 2 . . . . . B 1 2 … t e -1 +1 -1 +1
  • 47. 49 Sketches ¡ Sketch structure ¡ h : hash function from [0, … , N-1] to [0, 1, … , B] ¡ s : hash function from [0, … , N-1] to {+1, -1} ¡ Array of B counters: C1, …, CB (with B << N) ¡ Sketch maintenance when e arrives: Ch(e) += s(e) ¡ Use of sketch ¡ Estimation of frequency of object e: : ne ≈ Ch(e) . s(e) ¡ Actually t hash function h and t hash function s: ne ≈ median j∈[1…t] ( Chj(e) . sj(e) ) ¡ Theoretical results on error depending on N, t and B.
  • 48. 50 Sketches ¡ Algorithm ¡ Maintenance of a list (e1, e2, …, ek) of the current k most frequent elements ¡ For a new arriving element e ¡ Add e to the sketch structure ¡ Estimate frequency of e from the sketch structure ¡ If f(e) > f(ek), remove ek and insert e into the list
  • 49. Distributed Systems Answering today Big Data Needs S4, Storm, SAMOA, … 51
  • 50. Yahoo S4 ¡ Simple Scalable Streaming System ¡ Use a decentralized and symmetric architecture ¡ All nodes are similar, no master node ¡ Based on Processing Elements ¡ Each PE has 4 components: 1. Functionality defined by a PE class and associated configuration ¡ input event handler processEvent() ¡ output mechanism output() 2. the types of events that it consumes 3. the keyed attribute in those events 4. the value of the keyed attribute in events which it consumes ¡ Several PEs are available for standard tasks such as count, aggregate, join … ¡ Custom PEs can easily be programmed ¡ The PEs are atomatically deployed on the Processing Nodes (machines): ¡ Ensures load balancing of events, ¡ Notifies appropriate PEs when an event comes in.
  • 52. Twitter Storm ¡ Same princple as S4, with a simplified programming model ¡ Storm provides realtime computation ¡ Scalable ¡ Guarantees no data loss ¡ Extremely robust and fault-tolerant ¡ Programming language agnostic ¡ Concepts ¡ Streams ¡ Spouts ¡ Bolts ¡ Topologies
  • 53. 56 Lambda architecture (By Nathan Marz) REAL TIME STREAM PROCESSING SERVING LAYER SPEED LAYER PRECOMPUTED DATA FLOW QUERIES BATCH LAYER BATCH PROCESSING VIEWS Source: Mathieu DESPRIEE (USI) Generic, scalable and fault-tolerant data processing architecture
  • 54. Lambda architecture 1. All data entering the system is dispatched to both the batch layer and the speed layer for processing. 2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. 3. The serving layer indexes the batch views so that they can be queried in ad-hoc way. 4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 5. Any incoming query can be answered by merging results from batch views and real-time views. http://lambda-architecture.net/
  • 55. 58 Big Data Stream Mining Machine Learning Distributed Batch Hadoop Mahout Stream S4, Storm SAMOA Non Distributed Batch R, WEKA, … Stream MOA
  • 56. 59 What is SAMOA? • NEW Software framework for mining distributed data streams • Big Data mining for evolving streams in REAL-TIME
  • 57. Clustering Methods SAMOA SAMOA architecture Classifier Methods Frequent Pattern Mining S4 Storm … } Use S4, Storm, or other distributed stream processing platform } Use MOA, or other streaming machine learning library } Easy to extend through PACKAGES
  • 59. 62 Too much data streams
  • 60. 63 Type of data used in Big Data initiatives Internal data Traditional sources « New data » Source: Big Data opportunities survey, Unisphere / SAP, May 2013.
  • 61. 64 BI: New Generation 08/12/2014 Decision Monitoring, Alerts, Statistics, fault detection, etc. Semantic filtering and Continuous queries Heterogeneous and dynamic data streams Heterogeneous and static data sensors Semantic data streams Interconnection Ontologies Retro-action
  • 62. 65 Semantic Web technologies for data stream ¡ Annotate stream data with semantic metadata ¡ Apply Linked Data principles to publish streaming data ¡ Interlink streaming data with existing datasets ¡ Integrate data stream processing + reasoning ¡ Objectives : interoperability, automation, enrichment
  • 63. 66 Existing prototypes CQELS SPARQL-Stream EP-SPARQL C-SPARQL
  • 64. 67 Approaches In RDF Stream models (timestamps, events, time intervals, triple-based, graph-based …)
  • 65. 68 Example: SRBench Q: Detect if a hurricane has been observed “A hurricane has a sustained wind (for more than 3 hours) of at least 33 metres per second or 74 miles per hour (119 km/h).” ASK WHERE { STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 10800s SLIDE 600s] {?observation om-owl:procedure ?sensor ; om-owl:observedProperty weather:WindSpeed ; om-owl:result [ om-owl:floatValue ?value ] .} } GROUP BY ?sensor HAVING ( AVG(?value) >= "74"^^xsd:float ) ASK FROM STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 3h STEP 10m] WHERE { ?observation om-owl:procedure ?sensor ; om-owl:observedProperty weather:WindSpeed ; om-owl:result [ om-owl:floatValue ?value ] . } GROUP BY ?sensor HAVING ( AVG(?value) >= "74"^^xsd:float )
  • 66. 69 How to deal with Big Data Streams Cloud Computing DSMS Sampling/Load Shedding [1] et [2] [3] [1] J. Hoeksema & S. Kotoulas : High-performance Distributed Stream Reasoning using S4 (ISWC 2011) [2] D. L. Phuoc & al : Elastic and Scalable Processing of Linked Stream Data in the Cloud (ISWC 2013) [3] N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and Limited Storage Issues. (DaEng 2013)
  • 67. 70 Sampling Extensions for continuous SPARQL PREFIX vocab: http://data-gov.tw.rpi.edu/vocab/p/8/ SELECT ?val WHERE { STREAM <C:/CQELS/streams/data-8.stream>[NOW][UNISAMPLING %80] {?rawData vocab:ozone_8hr_daily_max ?val } GRAPH <C:/CQELS/test.rdf> {?user sioc:account_of ?person} } Operators: [UNISAMPLING %{Sampling Percentage}] [RESSAMPLING %{Reservoir Size}] [CHNSAMPLING %{Window Size}] N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and Limited Storage Issues. (DaEng 2013)
  • 68. 71 Semantic Data Stream Load Shedding (1/2) Observation_AirTemperature_ID TemperatureObservation type AirTemperature observedProperty procedure System_ID MeasureData_AirTemperature_ID result Instant_ID NN floatValue double fahrenheit MeasureData uom Instant type 2004-08-08T06:25:00 inXSDDateTime samplingTime Observation_RelativeHumidity_ID procedure RelativeHumidityObservation type RelativeHumidity observedProperty MeasureData_RelativeHumidity_ID result type floatValue percentage uom NN double (a) (b) (c) (d) (1) (2) RDF Triple approach : Effect of deleting the two triples below (1) and (2) (in dotted line) in the graph : - The deletion of the first RDF triple destroys the link connecting node (a) to node (b) - The deletion of the second RDF triple destroys the link connecting node (c) to node (d) => This has the effect of making nodes (b), (d) and all those connected to them unreachables, in spite of the presence of their data in memory. Which represents 6 RDF striples among 18 i.e. 33% of unusable data
  • 69. 72 Semantic Data Stream Load Shedding (2/2) RDF Graph approach : Effect of deleting sub-graphs such as those formed by nodes (b) or (d) and all nodes to which they are directly connected. => Preserving the semantic level of the information => Protecting the data consistency of the whole graph => Enhancing the Semantic Data Stream systems observedProperty
  • 70. 73 Conclusion: Big Data Stream challenges ¡ Semantic Information aggregation ¡ Information aggregation: “too much data to assimilate but not enough knowledge to act” ¡ Distributed and real-time processing ¡ Design of real-time and distributed algorithms for stream processing and information aggregation ¡ Distribution and parallelization of data mining algorithms ¡ Visual analytics and user modeling ¡ Dynamic user model ¡ Novel visualizations for very large datasets
  • 71. 74 Thanks to Zakia Kazi Aoul, ISEP Marie-Aude Aufaure, ECP Fethi Belghaouti, ISEP-INT Georges Hébrail, EDF R&D Sylvain Lefebvre, ISEP Yousra Chabchoub, ISEP
  • 72. Big Data Linked Data Volume, Variety, Velocity, Veracity, … Value Web of data, Seman(c Web -­‐ A set of principles and good prac1ces allowing to link, publish and search for web data -­‐ Structure and seman1cally enrich RDF data, with a very high scalability -­‐> Big Linked Data Integrate, aggregate, analyze, visualize large data sets, whatever is their type, provenance, speed of their flow … Big Linked Data Linked Big Data Seman8c Technologies Living Lab Linked & Big Data Academic Chair Our Value proposi8on – Seman1c aggrega1on from textual and non textual streams – Manage seman1c heterogeneity, real-­‐1me and distributed processing – Ensure data quality and veracity – Visual analy1cs