Potential of AI (Generative AI) in Business: Learnings and Insights
Advanced data science algorithms applied to scalable stream processing by David Piris and Ignacio García
1.
2. Advanced data science algorithms
applied to scalable stream processing
David Piris Valenzuela
Nacho García Fernández
Ignacio.g.Fernandez@treelogic.com
@0xNacho
david.piris@treelogic.com
@davidpiris
3. 3
About Treelogic
R&D intensive company with the mission of adapting technological knowledge to
improve quality standards in our daily life
8 ongoing H2020 projects (coordinating 3 of them)
8 ongoing FP7 projects (coordinating 5 of them)
Focused on providing Big Data Analytics in all the world
Internal organization
Research lines
Big Data
Computer vision
Data science
Social Media Analysis
Security
ICT solutions
Security & Safety
Justice
Health
Transport
Financial Services
ICT tailored solutions
4. CONTENTS
1. WHY WE NEED BIG DATA
2. BIG DATA: SOLUTIONS
3. BIG DATA: REAL-TIME PROCESSING
4. INCREMENTAL ALGORITHMS
5. WHAT WE WANT
6. WHAT WE NEED
1. A stream processing engine
2. Online incremental algorithms
3. A distributed data storage system
4. A use case
5. A visualization layer
5. CONTENTS
1. WHY WE NEED BIG DATA
2. BIG DATA: SOLUTIONS
3. BIG DATA: REAL-TIME PROCESSING
4. INCREMENTAL ALGORITHMS
5. WHAT WE WANT
6. WHAT WE NEED
1. A stream processing engine
2. Online incremental algorithms
3. A distributed data storage system
4. A use case
5. A visualization layer
7. 7
Why we need Big Data
Public and private sector companies store a huge mount of data
Countries with huge databases store data from
Population
Medical records
Taxes
Online transactions
Mobile transactions
Social Networks
In a single day, tweets generates 12 TB!!
8. 8
Why we need Big Data
2.5 Exabytes are produced every day!!!
530.000.000 million songs
150.000.000 iPhones
5 million laptops
90 years of HD Video
10. CONTENTS
1. WHY WE NEED BIG DATA
2. BIG DATA: SOLUTIONS
3. BIG DATA: REAL-TIME PROCESSING
4. INCREMENTAL ALGORITHMS
5. WHAT WE WANT
6. WHAT WE NEED
1. A stream processing engine
2. Online incremental algorithms
3. A distributed data storage system
4. A use case
5. A visualization layer
11. 11
Big Data: Solutions
First we can manage all historical repository, and retrieve some value from
data stored
Batch architecture
MapReduce
Hadoop Ecosystem
13. 13
Big Data: Solutions
Batch processing with Hadoop takes a lot of time and the need to process
ingested data and display results in a shortest way possible brings new
architecture and tools
Lambda architecture
Spark (memory vs disk)
15. CONTENTS
1. WHY WE NEED BIG DATA
2. BIG DATA: SOLUTIONS
3. BIG DATA: REAL-TIME PROCESSING
4. INCREMENTAL ALGORITHMS
5. WHAT WE WANT
6. WHAT WE NEED
1. A stream processing engine
2. Online incremental algorithms
3. A distributed data storage system
4. A use case
5. A visualization layer
16. 16
Big data: real-time processing
Faster results
Accurate results
Less expense
Please consumers
17. 17
Big data: real-time processing
As previously said, we need to extract and visualize information in near real
time…
18. 18
Big data: real-time processing
Flink as engine process
Stream processing
Windowing with events time semantics
Streaming and batch processing
19. 19
Big data: real-time processing
Kappa architecture
Batch layer removed
Only one set of code needs to be maintained
20. 20
Big data: real-time processing
No need to use batch layer
Avoid use disk in engine process (latency)
21. CONTENTS
1. WHY WE NEED BIG DATA
2. BIG DATA: SOLUTIONS
3. BIG DATA: REAL-TIME PROCESSING
4. INCREMENTAL ALGORITHMS
5. WHAT WE WANT
6. WHAT WE NEED
1. A stream processing engine
2. Online incremental algorithms
3. A distributed data storage system
4. A use case
5. A visualization layer
23. 23
Incremental algorithms
BI & BA people always want to made some common operations to retrieve
value and visualize data
We have operational tools in a relational or batch environment
How we can obtain average for a data stream that is changing every
second, minutes or even milliseconds…?
Common average operation is indicated for historical repository, data input
without any changes in the moment we start the process to obtain it.
Do we have tools to make it possible in a real time deployment?
25. 25
Incremental algorithms
Flink gives us the chance to operate with a new window processing concept.
We can decide and configure "small time pieces", and make some
operations or manipulate data in that time space.
27. 27
Incremental algorithms
These algorithms consume streams of data and are able to update their
results in a parallel manner without the need of saving the processed data
Using checkpoints in windowing, allows us to store result from previous
window process
32. 32
Incremental algorithms
In roadmap…
Standard deviation
Order by
Discretization
Contains
Split
Validate range values
Set default value to specific output
33. CONTENTS
1. WHY WE NEED BIG DATA
2. BIG DATA: SOLUTIONS
3. BIG DATA: REAL-TIME PROCESSING
4. INCREMENTAL ALGORITHMS
5. WHAT WE NEED
1. A stream processing engine
2. Online incremental algorithms
3. A distributed data storage system
4. A use case
5. A visualization layer
34. 34
Apache Flink vs Apache Spark
Pure streams for all workloads
Optimizer
Low latency, high throughput
Global, session, time and count based
window criteria
Provides automatic memory management
Micro-batches for all workloads
No job optimizer
High latency as compared to Flink
Time-based window criteria
Configurable memory management. Spark
1.6+ has move towards automating
memory management
36. CONTENTS
1. WHY WE NEED BIG DATA
2. BIG DATA: SOLUTIONS
3. BIG DATA: REAL-TIME PROCESSING
4. INCREMENTAL ALGORITHMS
5. WHAT WE NEED
1. A stream processing engine
2. Online incremental algorithms
3. A distributed data storage system
4. A use case
5. A visualization layer
40. CONTENTS
1. WHY WE NEED BIG DATA
2. BIG DATA: SOLUTIONS
3. BIG DATA: REAL-TIME PROCESSING
4. INCREMENTAL ALGORITHMS
5. WHAT WE NEED
1. A stream processing engine
2. Online incremental algorithms
3. A distributed data storage system
4. A use case
5. A visualization layer
41. 41
Apache Kudu
Provides a combination of fast inserts / updates and efficient columnar
scans to enable real-time analytic workloads
It is a new complements to HDFS and HBase
Designed for use cases that require fast analytics on fast data
Low query latency
V1.0.1 was released on October 11, 2016
42. CONTENTS
1. WHY WE NEED BIG DATA
2. BIG DATA: SOLUTIONS
3. BIG DATA: REAL-TIME PROCESSING
4. INCREMENTAL ALGORITHMS
5. WHAT WE NEED
1. A stream processing engine
2. Online incremental algorithms
3. A distributed data storage system
4. A use case
5. A visualization layer
43. 43
PROTEUS: a steel making scenario
Steel industry is a key sector for the European community.
PROTEUS was introduced last year at Big Data Spain by Treelogic *
Hot Strip mills (sometimes) produces steel with defects
Predict coil parameters (thickness, width, flatness) using real-time and historical data
Detecting defective coils in an early stage saves money. The production process can be
modified / stopped.
Proposed architecture is being validated in this project
7870 variables with a frequency of 500ms: data-in-motion
700.000 registers for each variables. 500GB time series and flatness map: data-at-rest
* https://www.youtube.com/watch?v=EIH7HLyqhfE
44. 44
PROTEUS: a steel-making scenario
Steel industry is a key sector for the European community.
PROTEUS was introduced last year at Big Data Spain by Treelogic *
Hot Strip mills (sometimes) produces steel with defects
Predict coil parameters (thickness, width, flatness) using real-time and historical data
Detecting defective coils in an early stage saves money. The production process can be
modified / stopped.
Proposed architecture is being validated in this project
7870 variables with a frequency of 500ms: data-in-motion
700.000 registers for each variables. 500GB time series and flatness map: data-at-rest
* https://www.youtube.com/watch?v=EIH7HLyqhfE
45. CONTENTS
1. WHY WE NEED BIG DATA
2. BIG DATA: SOLUTIONS
3. BIG DATA: REAL-TIME PROCESSING
4. INCREMENTAL ALGORITHMS
5. WHAT WE NEED
1. A stream processing engine
2. Online incremental algorithms
3. A distributed data storage system
4. A use case
5. A visualization layer
46. 46
Websockets
Websocket is a computer communication protocol providing full-duplex
communication channels over a single TCP connection.
Extremely faster than HTTP
Its API is standardized by the W3C
47. 47
Apache Flink & Websockets
Data sinks consume DataSets and are used to store or return them.
Flink comes with a variety of built-in output formats that are encapsulated behind
operations on the DataSet:
writeAsText()
writeAsFormattedText()
writeAsCsv()
print()
write()
We’ve developed a WebsocketSink enabling Flink to send outputs to a given
websocket endpoint.
Based on the javax-websocket-client-api 1.1 spec.
54. 54
How to get it all
https://github.com/proteus-h2020/proteus-docker
55. Advanced data science algorithms
applied to scalable stream processing
David Piris Valenzuela
Nacho García Fernández
Ignacio.g.Fernandez@treelogic.com
@0xNacho
david.piris@treelogic.com
@davidpiris