SlideShare uma empresa Scribd logo
1 de 22
Building a real time
Tweet map with
Flink in six weeks
OSTMap
Fast poc development with
flink
Proof of concept - an important tool in the
industry
• PoC often necessary to show feasibility to customers
• touch several topics:
• Scalability
• Stream processing
• Batch processing
• Storage and querying of data
• OSTMap as example PoC
Goals for OSTMap
• Increase trust into big data
technologies on customer side
• It is easy to build an application
with current technologies
• With almost no experience
• Teach students big data technologies
• Recruiting
• Bring big data to the university
• Build a real time application to view
recent geotagged tweets on a map
• Search for terms and users, show
these tweets on a map
• Analytics:
• First data science jobs
• …
Industry in practice: IT-Ringvorlesung 2016
• A course at the University of Leipzig.
• work on projects of local companies
• six students
• over a period of 6 weeks - no full time
invest
• Weekly meetings
• Github project: github.com/IIDP/OSTMap
Nico Graebling Vincent Märkl
Hans Dieter Pogrzeba
Christopher SchottChristopher Rost
Kevin Shrestha
Michael Schmeißer
Martin Grimmer
Matthias Kricke
OSTMap
mgm technology partners
We bring applications into production!
• Innovative software solution provider with application responsibility
• Specialist for highly scalable, transactional online applications
• Central lines of business: Insurance, E-Commerce, E-Government
• Founded in 1994
• 347 employees, 9 offices (2014)
• Revenue: 43,7 Mio € (2014)
• Part of Allgeier SE
ScaDS
Competence center for scalable data services and solutions Dresden/Leipzig
• bundled Big Data research expertise of the TU
Dresden and Leipzig University
• Drive Big Data innovations
• Bring industry and science together
• Knowledge exchange and transfer
Walking skeleton
“A Walking Skeleton is a tiny implementation of the system that performs a small end-to-
end function. It need not use the final architecture, but it should link together the main
architectural components. The architecture and the functionality can then evolve in
parallel.”
- Alistair Cockburn
gif from http://blog.codeclimate.com/blog/2014/03/20/kickstart-your-next-project-with-a-
walking-skeleton
Milestone 1
read stream, store data as json file, show tweets, read data from json files
Milestone 2
write to and read from accumulo, show tweets on map, full table scans, slow visualization
Milestone 3
Term index, geotemporal index, ui improvements, clustering, …
OSTMap – stream, batch, storage and querying
geotagged tweets
webservice
a) stream processing
b) batch processing
c) querying data
Stream processing of incoming data – first
version
GeoTweetSourc
e
KeyGeneration RawTweetSinkDateExtraction
This enabled us to build a slow term search and a slow map search via full table scans.
time index
data for
Stream processing of incoming data – final
version
TermIndexSink
GeoTweetSourc
e
KeyGeneration RawTweetSinkDateExtraction
Now we were able to build a faster term and map search and language frequency visualization.
time index
TermExtraction
(tokenizing)
UserExtraction
LanguageFrequ
encySink
Language
Extraction
term index
language statistics
GeoTemporalInd
exCreation
GeoTemporalInd
exSink
geotemporal index
data for
1 minute
window
sum by
language
Batch processing
• Initial creation of the term index and geotemporal
index for already processed tweets
• Data export
• Other statistics like:
• Area/ tweet distance a user covers with his tweets
Storage
Table Row Column Family Column Qualifier Value
RawTweetData (TimeIndex)
timestamp, hash
8b + 4b
- - raw tweet json
TermIndex term field (user,text)
RawTweetData key
12b
-
LanguageFrequency
time bucket
YYYYMMDDhhmm
language-tag -
tweet count
4b
Accumulo table design
Geotemporal Index for OSTMap
Geo index
geo data
geohashes used
as row keys
in accumulo
…
3z
6b
6c
6f
6q
9p
9r
9x
9z
d0
d1
d2
d3
d4
d5
d6
…
dg
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash (z
curve)
function from 2d coordinate
space to 1d key space
Row CF CQ
geohash RawTweetKey -
Geotemporal Index for OSTMap
Geo index – querying?
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash
bounding
box
calculate
coverage of
bounding box
range: [9p]
calculate scan
ranges from
coverage
range: [9r]
range:
[d0,d1,d2,d3]
…
3z
6b
6c
6f
6q
9p
9r
9x
9z
d0
d1
d2
d3
d4
d5
d6
…
dg
accumulo
iteratorsaccumulo
iterators
accumulo
iterators
result
Row CF CQ
geohash RawTweetKey lat/lon
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
Geotemporal Index for OSTMap
Add some time!
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash,
with timebuckets
…
13z
16b
16c
16f
16q
19p
19r
19x
19z
1d0
1d1
1d2
1d3
1d4
1d5
1d6
…
1dg
day
lon
lat
…
23z
26b
26c
26f
26q
29p
29r
29x
29z
2d0
2d1
2d2
2d3
2d4
2d5
2d6
…
2dg
…
Row CF CQ
day, geohash RawTweetKey lat/lon
day 1 day 2 day i …
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
Geotemporal Index for OSTMap
What about Hotspots?
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash,
with timebuckets
…
13z
16b
16c
16f
16q
19p
19r
19x
19z
1d0
1d1
1d2
1d3
1d4
1d5
1d6
…
1dg
day
lon
lat
…
23z
26b
26c
26f
26q
29p
29r
29x
29z
2d0
2d1
2d2
2d3
2d4
2d5
2d6
…
2dg
…
Row CF CQ
day, geohash RawTweetKey lat/lon
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
Geotemporal Index for OSTMap
What about Hotspots?
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash,
with timebuckets
day
lon
lat
…
12d2
12d3
12d4
…
…
Row CF CQ
sb, day, geohash RawTweetKey lat/lon
…
11d2
11d3
11d4
…
…
02d2
02d3
02d4
…
…
…
01d2
01d3
01d4
…
…
22d2
22d3
22d4
…
…
…
21d2
21d3
21d4
…
…
spreading byte
node 0
node 1
node 2
node n
• spreading byte = hash(tweet) % 255
• reproducable
• pre table splits in accumulo
demo
Martin Grimmer grimmer[at]informatik.uni-leipzig.de
Matthias Kricke kricke[at]informatik.uni-leipzig.de
www.mgm-tp.comwww.scads.de
Thank you
Michael Schmeißer michael.schmeisser[at]mgm-tp.com

Mais conteúdo relacionado

Mais procurados

RaspberryPiで作るガイガーカウンター
RaspberryPiで作るガイガーカウンターRaspberryPiで作るガイガーカウンター
RaspberryPiで作るガイガーカウンターYu Kusanagi
 
Python crash course for geologists in the mining industry
Python crash course for geologists in the mining industryPython crash course for geologists in the mining industry
Python crash course for geologists in the mining industryJohann Dangin
 
G2G マッピングに関するアップデート
G2G マッピングに関するアップデートG2G マッピングに関するアップデート
G2G マッピングに関するアップデートShota Matsumoto
 
Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...
Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...
Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...vishnu rao
 
EuroPython 2019: GeoSpatial Analysis using Python and JupyterHub
EuroPython 2019: GeoSpatial Analysis using Python and JupyterHubEuroPython 2019: GeoSpatial Analysis using Python and JupyterHub
EuroPython 2019: GeoSpatial Analysis using Python and JupyterHubMartin Christen
 
OpenHistoricMap: overview
OpenHistoricMap: overviewOpenHistoricMap: overview
OpenHistoricMap: overviewSK53
 
OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...
OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...
OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...mfrancis
 
Ronan Kerr: Exploring the Debris Disk Around Beta Pictoris
Ronan Kerr: Exploring the Debris Disk Around Beta PictorisRonan Kerr: Exploring the Debris Disk Around Beta Pictoris
Ronan Kerr: Exploring the Debris Disk Around Beta PictorisJeremyHeyl
 
Analysing OpenStreetMap Data with QGIS
Analysing OpenStreetMap Data with QGISAnalysing OpenStreetMap Data with QGIS
Analysing OpenStreetMap Data with QGISSK53
 
Open Historical Map: Vector Tiles & Other Updates
Open Historical Map: Vector Tiles & Other UpdatesOpen Historical Map: Vector Tiles & Other Updates
Open Historical Map: Vector Tiles & Other Updatesgwhathistory
 
Python Data Plotting and Visualisation Extravaganza
Python Data Plotting and Visualisation ExtravaganzaPython Data Plotting and Visualisation Extravaganza
Python Data Plotting and Visualisation ExtravaganzaGuy K. Kloss
 
Use of Nlog library in c#
Use of Nlog library in c#Use of Nlog library in c#
Use of Nlog library in c#bhai1122
 
LIDAR-derived DTM for archaeology and landscape history research some recent ...
LIDAR-derived DTM for archaeology and landscape history research some recent ...LIDAR-derived DTM for archaeology and landscape history research some recent ...
LIDAR-derived DTM for archaeology and landscape history research some recent ...Shaun Lewis
 
Mago3D Barcelona ICGC(카탈루니아 지형 및 지질연구소) 발표자료
Mago3D Barcelona ICGC(카탈루니아 지형 및 지질연구소) 발표자료Mago3D Barcelona ICGC(카탈루니아 지형 및 지질연구소) 발표자료
Mago3D Barcelona ICGC(카탈루니아 지형 및 지질연구소) 발표자료BJ Jang
 

Mais procurados (15)

RaspberryPiで作るガイガーカウンター
RaspberryPiで作るガイガーカウンターRaspberryPiで作るガイガーカウンター
RaspberryPiで作るガイガーカウンター
 
Python crash course for geologists in the mining industry
Python crash course for geologists in the mining industryPython crash course for geologists in the mining industry
Python crash course for geologists in the mining industry
 
G2G マッピングに関するアップデート
G2G マッピングに関するアップデートG2G マッピングに関するアップデート
G2G マッピングに関するアップデート
 
Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...
Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...
Build your own Real Time Analytics and Visualization, Enable Complex Event Pr...
 
EuroPython 2019: GeoSpatial Analysis using Python and JupyterHub
EuroPython 2019: GeoSpatial Analysis using Python and JupyterHubEuroPython 2019: GeoSpatial Analysis using Python and JupyterHub
EuroPython 2019: GeoSpatial Analysis using Python and JupyterHub
 
OpenHistoricMap: overview
OpenHistoricMap: overviewOpenHistoricMap: overview
OpenHistoricMap: overview
 
OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...
OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...
OSGi Community Event 2010 - OSGi and Terracotta - replication of states for c...
 
Ronan Kerr: Exploring the Debris Disk Around Beta Pictoris
Ronan Kerr: Exploring the Debris Disk Around Beta PictorisRonan Kerr: Exploring the Debris Disk Around Beta Pictoris
Ronan Kerr: Exploring the Debris Disk Around Beta Pictoris
 
Analysing OpenStreetMap Data with QGIS
Analysing OpenStreetMap Data with QGISAnalysing OpenStreetMap Data with QGIS
Analysing OpenStreetMap Data with QGIS
 
Open Historical Map: Vector Tiles & Other Updates
Open Historical Map: Vector Tiles & Other UpdatesOpen Historical Map: Vector Tiles & Other Updates
Open Historical Map: Vector Tiles & Other Updates
 
Python Data Plotting and Visualisation Extravaganza
Python Data Plotting and Visualisation ExtravaganzaPython Data Plotting and Visualisation Extravaganza
Python Data Plotting and Visualisation Extravaganza
 
Use of Nlog library in c#
Use of Nlog library in c#Use of Nlog library in c#
Use of Nlog library in c#
 
LIDAR-derived DTM for archaeology and landscape history research some recent ...
LIDAR-derived DTM for archaeology and landscape history research some recent ...LIDAR-derived DTM for archaeology and landscape history research some recent ...
LIDAR-derived DTM for archaeology and landscape history research some recent ...
 
Mago3D Barcelona ICGC(카탈루니아 지형 및 지질연구소) 발표자료
Mago3D Barcelona ICGC(카탈루니아 지형 및 지질연구소) 발표자료Mago3D Barcelona ICGC(카탈루니아 지형 및 지질연구소) 발표자료
Mago3D Barcelona ICGC(카탈루니아 지형 및 지질연구소) 발표자료
 
GStreamer Instruments
GStreamer InstrumentsGStreamer Instruments
GStreamer Instruments
 

Semelhante a Building a real time Tweet map with Flink in six weeks

Copy of Copy of Untitled presentation (1).pdf
Copy of Copy of Untitled presentation (1).pdfCopy of Copy of Untitled presentation (1).pdf
Copy of Copy of Untitled presentation (1).pdfjosephdonnelly2024
 
Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013Thoughtworks
 
ESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical dataESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical datageoknow
 
Esta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-dataEsta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-datageoknow
 
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Databricks
 
Chapter 6 project management
Chapter 6 project managementChapter 6 project management
Chapter 6 project managementShadina Shah
 
Engineering + Programming portfolio
Engineering + Programming portfolioEngineering + Programming portfolio
Engineering + Programming portfolioJosephDonnelly14
 
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020rodburns
 
Traveloka's data journey — Traveloka data meetup #2
Traveloka's data journey — Traveloka data meetup #2Traveloka's data journey — Traveloka data meetup #2
Traveloka's data journey — Traveloka data meetup #2Traveloka
 
Graph operations in Git version control system
Graph operations in Git version control systemGraph operations in Git version control system
Graph operations in Git version control systemJakub Narębski
 
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungScalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungRendy Bambang Junior
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Databasejavier ramirez
 
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...InfluxData
 
QTB Technology Lab - The Travel Domain, Beyond SQL, the Cloud, and more...
QTB Technology Lab - The Travel Domain, Beyond SQL, the Cloud, and more...QTB Technology Lab - The Travel Domain, Beyond SQL, the Cloud, and more...
QTB Technology Lab - The Travel Domain, Beyond SQL, the Cloud, and more...Thoughtworks
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBjavier ramirez
 
Graphite, an introduction
Graphite, an introductionGraphite, an introduction
Graphite, an introductionjamesrwu
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use CaseTiman Rebel
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 

Semelhante a Building a real time Tweet map with Flink in six weeks (20)

Portfolio
PortfolioPortfolio
Portfolio
 
Copy of Copy of Untitled presentation (1).pdf
Copy of Copy of Untitled presentation (1).pdfCopy of Copy of Untitled presentation (1).pdf
Copy of Copy of Untitled presentation (1).pdf
 
Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013
 
ESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical dataESTA-LD exploring spatio-temporal linked statistical data
ESTA-LD exploring spatio-temporal linked statistical data
 
Esta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-dataEsta ld -exploring-spatio-temporal-linked-statistical-data
Esta ld -exploring-spatio-temporal-linked-statistical-data
 
CitySDK Workshop Feedback
CitySDK Workshop FeedbackCitySDK Workshop Feedback
CitySDK Workshop Feedback
 
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
 
Chapter 6 project management
Chapter 6 project managementChapter 6 project management
Chapter 6 project management
 
Engineering + Programming portfolio
Engineering + Programming portfolioEngineering + Programming portfolio
Engineering + Programming portfolio
 
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
 
Traveloka's data journey — Traveloka data meetup #2
Traveloka's data journey — Traveloka data meetup #2Traveloka's data journey — Traveloka data meetup #2
Traveloka's data journey — Traveloka data meetup #2
 
Graph operations in Git version control system
Graph operations in Git version control systemGraph operations in Git version control system
Graph operations in Git version control system
 
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungScalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev Bandung
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
 
QTB Technology Lab - The Travel Domain, Beyond SQL, the Cloud, and more...
QTB Technology Lab - The Travel Domain, Beyond SQL, the Cloud, and more...QTB Technology Lab - The Travel Domain, Beyond SQL, the Cloud, and more...
QTB Technology Lab - The Travel Domain, Beyond SQL, the Cloud, and more...
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
 
Graphite, an introduction
Graphite, an introductionGraphite, an introduction
Graphite, an introduction
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use Case
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 

Último

Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimaginedpanagenda
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfUK Journal
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 

Último (20)

Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 

Building a real time Tweet map with Flink in six weeks

  • 1. Building a real time Tweet map with Flink in six weeks OSTMap Fast poc development with flink
  • 2. Proof of concept - an important tool in the industry • PoC often necessary to show feasibility to customers • touch several topics: • Scalability • Stream processing • Batch processing • Storage and querying of data • OSTMap as example PoC
  • 3. Goals for OSTMap • Increase trust into big data technologies on customer side • It is easy to build an application with current technologies • With almost no experience • Teach students big data technologies • Recruiting • Bring big data to the university • Build a real time application to view recent geotagged tweets on a map • Search for terms and users, show these tweets on a map • Analytics: • First data science jobs • …
  • 4. Industry in practice: IT-Ringvorlesung 2016 • A course at the University of Leipzig. • work on projects of local companies • six students • over a period of 6 weeks - no full time invest • Weekly meetings • Github project: github.com/IIDP/OSTMap Nico Graebling Vincent Märkl Hans Dieter Pogrzeba Christopher SchottChristopher Rost Kevin Shrestha Michael Schmeißer Martin Grimmer Matthias Kricke OSTMap
  • 5. mgm technology partners We bring applications into production! • Innovative software solution provider with application responsibility • Specialist for highly scalable, transactional online applications • Central lines of business: Insurance, E-Commerce, E-Government • Founded in 1994 • 347 employees, 9 offices (2014) • Revenue: 43,7 Mio € (2014) • Part of Allgeier SE
  • 6. ScaDS Competence center for scalable data services and solutions Dresden/Leipzig • bundled Big Data research expertise of the TU Dresden and Leipzig University • Drive Big Data innovations • Bring industry and science together • Knowledge exchange and transfer
  • 7. Walking skeleton “A Walking Skeleton is a tiny implementation of the system that performs a small end-to- end function. It need not use the final architecture, but it should link together the main architectural components. The architecture and the functionality can then evolve in parallel.” - Alistair Cockburn gif from http://blog.codeclimate.com/blog/2014/03/20/kickstart-your-next-project-with-a- walking-skeleton
  • 8. Milestone 1 read stream, store data as json file, show tweets, read data from json files
  • 9. Milestone 2 write to and read from accumulo, show tweets on map, full table scans, slow visualization
  • 10. Milestone 3 Term index, geotemporal index, ui improvements, clustering, …
  • 11. OSTMap – stream, batch, storage and querying geotagged tweets webservice a) stream processing b) batch processing c) querying data
  • 12. Stream processing of incoming data – first version GeoTweetSourc e KeyGeneration RawTweetSinkDateExtraction This enabled us to build a slow term search and a slow map search via full table scans. time index data for
  • 13. Stream processing of incoming data – final version TermIndexSink GeoTweetSourc e KeyGeneration RawTweetSinkDateExtraction Now we were able to build a faster term and map search and language frequency visualization. time index TermExtraction (tokenizing) UserExtraction LanguageFrequ encySink Language Extraction term index language statistics GeoTemporalInd exCreation GeoTemporalInd exSink geotemporal index data for 1 minute window sum by language
  • 14. Batch processing • Initial creation of the term index and geotemporal index for already processed tweets • Data export • Other statistics like: • Area/ tweet distance a user covers with his tweets
  • 15. Storage Table Row Column Family Column Qualifier Value RawTweetData (TimeIndex) timestamp, hash 8b + 4b - - raw tweet json TermIndex term field (user,text) RawTweetData key 12b - LanguageFrequency time bucket YYYYMMDDhhmm language-tag - tweet count 4b Accumulo table design
  • 16. Geotemporal Index for OSTMap Geo index geo data geohashes used as row keys in accumulo … 3z 6b 6c 6f 6q 9p 9r 9x 9z d0 d1 d2 d3 d4 d5 d6 … dg 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g partitioned by geohash (z curve) function from 2d coordinate space to 1d key space Row CF CQ geohash RawTweetKey -
  • 17. Geotemporal Index for OSTMap Geo index – querying? 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g partitioned by geohash bounding box calculate coverage of bounding box range: [9p] calculate scan ranges from coverage range: [9r] range: [d0,d1,d2,d3] … 3z 6b 6c 6f 6q 9p 9r 9x 9z d0 d1 d2 d3 d4 d5 d6 … dg accumulo iteratorsaccumulo iterators accumulo iterators result Row CF CQ geohash RawTweetKey lat/lon
  • 18. 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g Geotemporal Index for OSTMap Add some time! 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g partitioned by geohash, with timebuckets … 13z 16b 16c 16f 16q 19p 19r 19x 19z 1d0 1d1 1d2 1d3 1d4 1d5 1d6 … 1dg day lon lat … 23z 26b 26c 26f 26q 29p 29r 29x 29z 2d0 2d1 2d2 2d3 2d4 2d5 2d6 … 2dg … Row CF CQ day, geohash RawTweetKey lat/lon day 1 day 2 day i …
  • 19. 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g Geotemporal Index for OSTMap What about Hotspots? 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g partitioned by geohash, with timebuckets … 13z 16b 16c 16f 16q 19p 19r 19x 19z 1d0 1d1 1d2 1d3 1d4 1d5 1d6 … 1dg day lon lat … 23z 26b 26c 26f 26q 29p 29r 29x 29z 2d0 2d1 2d2 2d3 2d4 2d5 2d6 … 2dg … Row CF CQ day, geohash RawTweetKey lat/lon
  • 20. 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g Geotemporal Index for OSTMap What about Hotspots? 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g partitioned by geohash, with timebuckets day lon lat … 12d2 12d3 12d4 … … Row CF CQ sb, day, geohash RawTweetKey lat/lon … 11d2 11d3 11d4 … … 02d2 02d3 02d4 … … … 01d2 01d3 01d4 … … 22d2 22d3 22d4 … … … 21d2 21d3 21d4 … … spreading byte node 0 node 1 node 2 node n • spreading byte = hash(tweet) % 255 • reproducable • pre table splits in accumulo
  • 21. demo
  • 22. Martin Grimmer grimmer[at]informatik.uni-leipzig.de Matthias Kricke kricke[at]informatik.uni-leipzig.de www.mgm-tp.comwww.scads.de Thank you Michael Schmeißer michael.schmeisser[at]mgm-tp.com

Notas do Editor

  1. 8