SlideShare uma empresa Scribd logo
1 de 74
Optimizing Geospatial Operations with Server-side
Programming in HBase and Accumulo
James Hughes, CCRi
James Hughes
● CCRi’s Director of Open Source Programs
● Working in geospatial software on the JVM
for the last 7 years
● GeoMesa core committer / product owner
● SFCurve project lead
● JTS committer
● Contributor to GeoTools and GeoServer
● Background / Warm-up / What we are talking about
○ What is GeoMesa?
○ Quick Demo
● General Implementation Details
○ Indexing on Accumulo/HBase with Space Filling Curves
○ Filtering/transforming
■ Applying secondary filters
■ Changing output (projections / format changes)
○ Aggregations
■ Heatmaps
■ Stats
● Database specifics
○ Accumulo Implementation details
○ HBase Implementation details
Talk outline
Motivation ● What is geospatial?
● IoT based data examples?
Spatial Data Types
Points
Locations
Events
Instantaneous
Positions
Lines
Road networks
Voyages
Trips
Trajectories
Polygons
Administrative
Regions
Airspaces
Spatial Data Relationships
equals
disjoint
intersects
touches
crosses
within
contains
overlaps
Topology Operations
7
Algorithms
● Convex Hull
● Buffer
● Validation
● Dissolve
● Polygonization
● Simplification
● Triangulation
● Voronoi
● Linear Referencing
● and more...
GeoMesa ● GeoMesa Overview
What is GeoMesa?
A suite of tools for streaming, persisting, managing, and analyzing spatio-
temporal data at scale
What is GeoMesa?
A suite of tools for streaming, persisting, managing, and analyzing spatio-
temporal data at scale
What is GeoMesa?
A suite of tools for streaming, persisting, managing, and analyzing spatio-temporal
data at scale
What is GeoMesa?
A suite of tools for streaming, persisting, managing, and analyzing spatio-
temporal data at scale
What is GeoMesa?
A suite of tools for streaming, persisting, managing, and analyzing spatio-
temporal data at scale
Proposed Reference Architecture
Live Demo!
● Filtering by spatio-temporal
constraints
● Filtering by attributes
● Aggregations
● Transformations
Indexing
Geospatial
Data ● Key Design using Space Filling
Curves
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
Space Filling Curves (in one slide!)
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
Space Filling Curves (in one slide!)
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order that the curve
visits the them.
○ Associate the data in that grid cell with a byte
representation of the label.
Space Filling Curves (in one slide!)
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order that the curve
visits the them.
○ Associate the data in that grid cell with a byte
representation of the label.
● We prefer “good” space filling curves:
○ Want recursive curves and locality.
Space Filling Curves (in one slide!)
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order that the curve
visits the them.
○ Associate the data in that grid cell with a byte
representation of the label.
● We prefer “good” space filling curves:
○ Want recursive curves and locality.
● Space filling curves have higher
dimensional analogs.
Space Filling Curves (in one slide!)
To query for points in the grey rectangle, the
query planner enumerates a collection of index
ranges which cover the area.
Note: Most queries won’t line up perfectly with the
gridding strategy.
Further filtering can be run on the
tablet/region servers (next section)
or we can return ‘loose’ bounding box results
(likely more quickly).
Query planning with Space Filling Curves
Server-Side
Optimizations
Filtering and transforming records
● Pushing down data filters
○ Z2/Z3 filter
○ CQL Filters
● Projections
Filtering and transforming records overview
Using Accumulo iterators and HBase filters, it is possible to filter and map over
the key-values pairs scanned.
This will let us apply fine-grained spatial filtering, filter by secondary predicates,
and implement projections.
Pushing down filters
Let’s consider a query for tankers which are inside a bounding box for a given
time period.
GeoMesa’s Z3 index is designed to provide a set of key ranges to scan which will
cover the spatio-temporal range.
Additional information such as the vessel type is part of the value.
Using server-side programming, we can teach Accumulo and HBase how to
understand the records and filter out undesirable records.
This reduces network traffic and distributes the work.
Projection
To handle projections in a query, Accumulo Iterators and HBase Filters can
change the returned key-value pairs.
Changing the key is a bad idea.
Changing the value allows for GeoMesa to return a subset of the columns that a
user is requesting.
GeoMesa Server-Side Filters
● Z2/Z3 filter
○ Scan ranges are not decomposed enough to be very accurate - fast bit-wise comparisons on
the row key to filter out-of-bounds data
● CQL/Transform filter
○ If a predicate is not handled by the scan ranges or Z filters,
then slower GeoTools CQL filters are applied to the serialized SimpleFeature in the row value
○ Relational projections (transforms) applied to reduce the amount of data sent back
● Other specialized filters
○ Age-off for expiring rows based on a SimpleFeature attribute
○ Attribute-key-value for populating a partial SimpleFeature with an attribute value from the
row
○ Visibility filter for merging columns into a SimpleFeature when using attribute-level
visibilities
Server-Side
Optimizations
Aggregations
● Generating heatmaps
● Descriptive Stats
● Arrow format
Aggregations
Using Accumulo Iterators and HBase coprocessors, it is possible to aggregate
multiple key-value pairs into one response. Effectively, this lets one implement
map and reduce algorithms.
These aggregations include computing heatmaps, stats, and custom data
formats.
The ability to aggregate data can be composed with filtering and projections.
GeoMesa Aggregation Abstractions
Aggregation logic is implemented in a shared module, based on a lifecycle of
1. Initialization
2. observing some number of features
3. aggregating a result.
This paradigm is easily adapted to the specific implementations required by
Accumulo and HBase.
Notably, all the algorithms we describe work in a single pass over the data.
GeoMesa Aggregation Abstractions
The main logic is contained in the AggregatingScan class:
Visualization Example: Heatmaps
Without powerful visualization options, big data is big nonsense.
Consider this view of shipping in the Mediterranean sea
Visualization Example: Heatmaps
Without powerful visualization options, big data is big nonsense.
Consider this view of shipping in the Mediterranean sea
Generating Heatmaps
Heatmaps are implemented in DensityScan.
For the scan, we set up a 2D grid array representing the pixels to be displayed. On
the region/tablet servers, each feature increments the count of any cells
intersecting its geometry. The resulting grid is returned as a serialized array of
64-bit integers, minimizing the data transfer back to the client.
The client process merges the grids from each scan range, then normalizes the
data to produce an image.
Since less data is transmitted, heatmaps are generally faster.
Statistical Queries
We support a flexible stats API that includes counts, min/max values,
enumerations, top-k (StreamSummary), frequency (CountMinSketch),
histograms and descriptive statistics. We use well-known streaming algorithms
backed by data structures that can be serialized and merged together.
Statistical queries are implemented in StatsScan.
On the region/tablet servers, we set up the data structure and then add each
feature as we scan. The client receives the serialized stats, merges them
together, and displays them as either JSON or a Stat instance that can be
accessed programmatically.
Arrow Format
Apache Arrow is a columnar, in-memory data format that GeoMesa supports as
an output type. In particular, it can be used to drive complex in-browser
visualizations. Arrow scans are implemented in ArrowScan.
With Arrow, the data returned from the region/tablet servers is similar in size to a
normal query. However, the processing required to generate Arrow files can be
distributed across the cluster instead of being done in the client.
As we scan, each feature is added to an in-memory Arrow vector. When we hit the
configured batch size, the current vector is serialized into the Arrow IPC format
and sent back to the client. All the client needs to do is to create a header and
then concatenate the batches into a single response.
Server-Side
Optimizations
Data
● Row Values
● Tables/compactions
Row Values
Our first approach was to store each SimpleFeature attribute in a separate
column. However, this proved to be slow to scan.
Even when skipping columns for projections, they are still loaded off disk.
Column groups seem promising, but they kill performance if you query more than
one.
Row Values
Our second (and current) approach is to store the entire serialized SimpleFeature
in one column.
Further optimizations:
● Lazy deserialization - SimpleFeature implementation that wraps the row
value and only deserializes each attribute as needed
● Feature ID is already stored in the row key to prevent row collisions, so don’t
also store it in the row value
● Use BSON for JSON serialization, along with JsonPath extractors
● Support for TWKB geometry serialization to save space
Tables/Compactions
When dealing with streaming data sources, continuously writing data to a table
will cause a lot of compactions.
Table partitioning can mitigate this by creating a new table per time period (e.g.
day/week), extracted from the SimpleFeature. Generally only the most recent
table(s) will be compacted.
For frequent updates to existing features, the GeoMesa Lambda store uses Kafka
as a medium-term cache before persisting to the key-value store. This reduces
the cluster load significantly.
Accumulo Server
Side Programming ● Accumulo Iterator Review
● GeoMesa’s Accumulo iteraors
“Iterators provide a modular mechanism for adding functionality to be executed
by TabletServers when scanning or compacting data. This allows users to
efficiently summarize, filter, and aggregate data.” -- Accumulo 1.7
documentation
Part of the modularity is that the iterators can be stacked:
t the output of one can be wired into the next.
Example: The first iterator might read from disk, the second could filter with
Authorizations, and a final iterator could filter by column family.
Other notes:
● Iterators provided a sorted view of the key/values.
Accumulo Iterators
A request to GeoMesa consists of two broad pieces:
1. A filter restricting the data to act on, e.g.:
a. Records in Maryland with ‘Accumulo’ in the text field.
b. Records during the first week of 2016.
2. A request for ‘how’ to return the data, e.g.:
a. Return the full records
b. Return a subset of the record (either a projection or ‘bin’ file format)
c. Return a histogram
d. Return a heatmap / kernel density
Generally, a filter can be handled partially by selecting which ranges to scan; the remainder
can be handled by an Iterator.
Modifications to selected data can also be handled by a GeoMesa Iterator.
GeoMesa Data Requests
The first pass of GeoMesa iterators separated concerns into separate iterators.
The GeoMesa query planner assembled a stack of iterators to achieve the desired
result.
Initial GeoMesa Iterator design
Image from “Spatio-temporal Indexing in Non-relational Distributed Databases” by
Anthony Fox, Chris Eichelberger, James Hughes, Skylar Lyon
The key benefit to having decomposed iterators is that they are easier to
understand and re-mix.
In terms of performance, each one needs to understand the bytes in the Key and
Value. In many cases, this will lead to additional serialization/deserialization.
Now, we prefer to write Iterators which handle transforming the underlying data
into what the client code is expecting in one go.
Second GeoMesa Iterator design
1. Using fewer iterators in the stack can be beneficial
2. Using lazy evaluation / deserialization for filtering Values can power speed
improvements.
3. Iterators take in Sorted Keys + Values and *must* produce Sorted Keys and
Values.
Lessons learned about Iterators
HBase Server Side
Programming
● HBase Filter and Coprocessor
Review
● GeoMesa HBase Filter
● GeoMesa HBase Coprocessor
HBase Filter Info
HBase filters are restricted to the ability to skip/include rows, and to transform a
row before returning it. Anything more complicated requires a Coprocessor.
In contrast to Accumulo, where iterators are configured with a map of options,
HBase requires custom serialization code for each filter implementation.
HBase Filter Info
The main GeoMesa filters are:
● org.locationtech.geomesa.hbase.filters.CqlTransformFilter
○ Filters rows based on GeoTools CQL
○ Transforms rows based on relational projections
● org.locationtech.geomesa.hbase.filters.Z2HBaseFilter
○ Compares Z-values against the row key
● org.locationtech.geomesa.hbase.filters.Z3HBaseFilter
○ Compares Z-values against the row key
HBase Coprocessor Info
Coprocessors are not trivial to implement or invoke, and can starve your cluster if
it is not configured correctly.
GeoMesa implements a harness to invoke a coprocessor, receive the results, and
handle any errors:
● org.locationtech.geomesa.hbase.coprocessor.GeoMesaCoprocessor
An adapter layer links the common aggregating code to the coprocessor API:
● org.locationtech.geomesa.hbase.coprocessor.aggregators.HBaseAggregato
r
HBase Coprocessor Info
GeoMesa defines a single Protobuf coprocessor endpoint, modeled around the
Accumulo iterator lifecycle. The aggregator class and a map of options are
passed to the endpoint.
Each aggregating scan requires a trivial adapter implementation:
● HBaseDensityAggregator
● HBaseStatsAggregator
● HBaseArrowAggregator
Thanks!
James Hughes
● jhughes@ccri.com
● http://geomesa.org
● http://gitter.im/locationtech/geomesa
Backup Slides
Integration with MapReduce / Spark
● GeoMesa + Spark Setup
● GeoMesa + Spark Analytics
● GeoMesa powered notebooks
(Jupyter and Zeppelin)
Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing
more infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on
each tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
GeoMesa MapReduce and Spark Support
Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
Accumulo Implements the MapReduce InputFormat interface.
GeoMesa MapReduce and Spark Support
Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
Accumulo Implements the MapReduce InputFormat interface.
Spark provides a way to change InputFormats into RDDs.
GeoMesa MapReduce and Spark Support
Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
Accumulo Implements the MapReduce InputFormat
interface.
GeoMesa MapReduce and Spark Support
GeoMesa Spark Example 1: Time SeriesStep 1: Get an RDD[SimpleFeature]
Step 2: Calculate the time series
Step 3: Plot the time series in R.
Using one dataset (country boundaries)
to group another (here, GDELT) is
effectively a join.
Our summer intern, Atallah, worked out
the details of doing this analysis in
Spark and created a tutorial and blog
post.
This picture shows ‘stability’ of a region
from GDELT Goldstein values
GeoMesa Spark Example 2: Aggregating by Regions
http://www.ccri.com/2016/08/17/new-geomesa-tutorial-aggregating-visualizing-data/
http://www.geomesa.org/documentation/tutorials/shallow-join.html
GeoMesa Spark Example 3: Aggregating Tweets about #traffic
Virginia Polygon CQL
GeoMesa RDD
Aggregate by County
Calculate ratio of #traffic
Store back to GeoMesa
GeoMesa Spark Example 3: Aggregating Tweets about #traffic
#traffic by Virginia county
Darker blue has a higher count
Interactive Data Discovery at Scale in GeoMesa Notebooks
Writing (and debugging!) MapReduce / Spark jobs is slow and requires expertise.
A long development cycle for an analytic saps energy and creativity.
The answer to both is interactive ‘notebook’ servers like Apache Zeppelin and
Jupyter (formerly iPython Notebook).
Interactive Data Discovery at Scale in GeoMesa Notebooks
Writing (and debugging!) MapReduce / Spark jobs is slow and requires expertise.
A long development cycle for an
analytic saps energy and creativity.
The answer to both is interactive ‘notebook’
servers like Apache Zeppelin and Jupyter
There are two big things to work out:
1. Getting the right libraries on the classpath.
2. Wiring up visualizations.
Interactive Data Discovery at Scale in GeoMesa Notebooks
GeoMesa Notebook Roadmap:
● Improved JavaScript integration
● D3.js and other visualization
libraries
● OpenLayers and Leaflet
● Python Bindings
Backup Slides
Indexing non-point geometries
Most approaches to indexing non-point geometries involve covering the
geometry with a number of grid cells and storing a copy with each index.
This means that the client has to deduplicate results which is expensive.
Indexing non-point geometries: XZ Index
Most approaches to indexing non-point
geometries involve covering the
geometry with a number of grid cells
and storing a copy with each index.
This means that the client has to
deduplicate results which is expensive.
Böhm, Klump, and Kriegel describe an indexing
strategy allows such geometries to be stored once.
GeoMesa has implemented this strategy in XZ2
(spatial-only) and XZ3 (spatio-temporal) tables.
The key is to store data by resolution, separate
geometries by size, and then index them by their
lower left corner.
This does require consideration on the query
planning side, but avoiding deduplication is worth
the trade-off.
Indexing non-point geometries: XZ Index
For more details, see Böhm, Klump, and Kriegel. “XZ-ordering: a space-filling curve for objects with spatial
extension.” 6th. Int. Symposium on Large Spatial Databases (SSD), 1999, Hong Kong, China.
(http://www.dbs.ifi.lmu.de/Publikationen/Boehm/Ordering_99.pdf)
Demo
Backup Slides
Here the viewport is used as
the spatial bounds for the
query.
The time range is a 12 hour
period on Monday.
Query by bounding box
Query by polygon
Here we further restrict the
query region by an arbitrary
polygon
Query by polygon and vessel type
Here, we have added a clause
to restrict to cargo vessels
Query by polygon and vessel type (heatmap)
Heatmaps can be generated
Query by polygon and vessel type (Apache Arrow format)
Data can be returned in a
number of formats.
The Apache Arrow format
allows for rapid access in
JavaScript.
Here, points are colored by
callsign.
Query by polygon and vessel type (Apache Arrow format)
Apache Arrow allows for in
browser data exploration.
This histogram shows
callsigns grouped by country.
Selections in the histogram
can influence the map.

Mais conteúdo relacionado

Mais procurados

Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeDatabricks
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
 
Spark streaming
Spark streamingSpark streaming
Spark streamingWhiteklay
 
Scaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitterScaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitterlohitvijayarenu
 
Managing 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloudlohitvijayarenu
 
Real-Time Attribution with Structured Streaming and Databricks Delta with Car...
Real-Time Attribution with Structured Streaming and Databricks Delta with Car...Real-Time Attribution with Structured Streaming and Databricks Delta with Car...
Real-Time Attribution with Structured Streaming and Databricks Delta with Car...Databricks
 
How @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudHow @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudlohitvijayarenu
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit
 
Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubesmister_zed
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil AmbagadeSigmoid
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data AppTrieu Nguyen
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionDatabricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoopColin Su
 

Mais procurados (20)

Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark StreamingBellevue Big Data meetup: Dive Deep into Spark Streaming
Bellevue Big Data meetup: Dive Deep into Spark Streaming
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Scaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitterScaling HDFS for Exabyte Storage@twitter
Scaling HDFS for Exabyte Storage@twitter
 
Managing 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloud
 
Real-Time Attribution with Structured Streaming and Databricks Delta with Car...
Real-Time Attribution with Structured Streaming and Databricks Delta with Car...Real-Time Attribution with Structured Streaming and Databricks Delta with Car...
Real-Time Attribution with Structured Streaming and Databricks Delta with Car...
 
How @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudHow @twitterhadoop chose google cloud
How @twitterhadoop chose google cloud
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
 
Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubes
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoop
 

Semelhante a Optimizing Geospatial Operations with Server-side Programming in HBase and Accumulo

Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit
 
2017 PLSC Track: Using a Standard Version of ArcMap with External VRS Recieve...
2017 PLSC Track: Using a Standard Version of ArcMap with External VRS Recieve...2017 PLSC Track: Using a Standard Version of ArcMap with External VRS Recieve...
2017 PLSC Track: Using a Standard Version of ArcMap with External VRS Recieve...GIS in the Rockies
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
Spatialware_2_Sql08
Spatialware_2_Sql08Spatialware_2_Sql08
Spatialware_2_Sql08Mike Osbourn
 
Spatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use CasesSpatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use Casesmathieuraj
 
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...Dave Stokes
 
Skyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed EnvironmentSkyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed EnvironmentIJMER
 
user_defined_functions_forinterpolation
user_defined_functions_forinterpolationuser_defined_functions_forinterpolation
user_defined_functions_forinterpolationsushanth tiruvaipati
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperDerek Diamond
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTEScuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTESSubhajit Sahu
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)Subhajit Sahu
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Gis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsGis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsAhmad Jawwad
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduceShrihari Rathod
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf pointsocporacledba
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf pointsdba3003
 

Semelhante a Optimizing Geospatial Operations with Server-side Programming in HBase and Accumulo (20)

Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
 
Druid
DruidDruid
Druid
 
2017 PLSC Track: Using a Standard Version of ArcMap with External VRS Recieve...
2017 PLSC Track: Using a Standard Version of ArcMap with External VRS Recieve...2017 PLSC Track: Using a Standard Version of ArcMap with External VRS Recieve...
2017 PLSC Track: Using a Standard Version of ArcMap with External VRS Recieve...
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Spatialware_2_Sql08
Spatialware_2_Sql08Spatialware_2_Sql08
Spatialware_2_Sql08
 
Spatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use CasesSpatial Data Integrator - Software Presentation and Use Cases
Spatial Data Integrator - Software Presentation and Use Cases
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
 
Skyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed EnvironmentSkyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed Environment
 
user_defined_functions_forinterpolation
user_defined_functions_forinterpolationuser_defined_functions_forinterpolation
user_defined_functions_forinterpolation
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTEScuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Gis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsGis capabilities on Big Data Systems
Gis capabilities on Big Data Systems
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 

Último

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesExploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesSanjay Willie
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Último (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesExploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Optimizing Geospatial Operations with Server-side Programming in HBase and Accumulo

  • 1. Optimizing Geospatial Operations with Server-side Programming in HBase and Accumulo James Hughes, CCRi
  • 2. James Hughes ● CCRi’s Director of Open Source Programs ● Working in geospatial software on the JVM for the last 7 years ● GeoMesa core committer / product owner ● SFCurve project lead ● JTS committer ● Contributor to GeoTools and GeoServer
  • 3. ● Background / Warm-up / What we are talking about ○ What is GeoMesa? ○ Quick Demo ● General Implementation Details ○ Indexing on Accumulo/HBase with Space Filling Curves ○ Filtering/transforming ■ Applying secondary filters ■ Changing output (projections / format changes) ○ Aggregations ■ Heatmaps ■ Stats ● Database specifics ○ Accumulo Implementation details ○ HBase Implementation details Talk outline
  • 4. Motivation ● What is geospatial? ● IoT based data examples?
  • 5. Spatial Data Types Points Locations Events Instantaneous Positions Lines Road networks Voyages Trips Trajectories Polygons Administrative Regions Airspaces
  • 7. Topology Operations 7 Algorithms ● Convex Hull ● Buffer ● Validation ● Dissolve ● Polygonization ● Simplification ● Triangulation ● Voronoi ● Linear Referencing ● and more...
  • 9. What is GeoMesa? A suite of tools for streaming, persisting, managing, and analyzing spatio- temporal data at scale
  • 10. What is GeoMesa? A suite of tools for streaming, persisting, managing, and analyzing spatio- temporal data at scale
  • 11. What is GeoMesa? A suite of tools for streaming, persisting, managing, and analyzing spatio-temporal data at scale
  • 12. What is GeoMesa? A suite of tools for streaming, persisting, managing, and analyzing spatio- temporal data at scale
  • 13. What is GeoMesa? A suite of tools for streaming, persisting, managing, and analyzing spatio- temporal data at scale
  • 15. Live Demo! ● Filtering by spatio-temporal constraints ● Filtering by attributes ● Aggregations ● Transformations
  • 16. Indexing Geospatial Data ● Key Design using Space Filling Curves
  • 17. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves Space Filling Curves (in one slide!)
  • 18. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves ● First, ‘grid’ the data space into bins. Space Filling Curves (in one slide!)
  • 19. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves ● First, ‘grid’ the data space into bins. ● Next, order the grid cells with a space filling curve. ○ Label the grid cells by the order that the curve visits the them. ○ Associate the data in that grid cell with a byte representation of the label. Space Filling Curves (in one slide!)
  • 20. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves ● First, ‘grid’ the data space into bins. ● Next, order the grid cells with a space filling curve. ○ Label the grid cells by the order that the curve visits the them. ○ Associate the data in that grid cell with a byte representation of the label. ● We prefer “good” space filling curves: ○ Want recursive curves and locality. Space Filling Curves (in one slide!)
  • 21. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves ● First, ‘grid’ the data space into bins. ● Next, order the grid cells with a space filling curve. ○ Label the grid cells by the order that the curve visits the them. ○ Associate the data in that grid cell with a byte representation of the label. ● We prefer “good” space filling curves: ○ Want recursive curves and locality. ● Space filling curves have higher dimensional analogs. Space Filling Curves (in one slide!)
  • 22. To query for points in the grey rectangle, the query planner enumerates a collection of index ranges which cover the area. Note: Most queries won’t line up perfectly with the gridding strategy. Further filtering can be run on the tablet/region servers (next section) or we can return ‘loose’ bounding box results (likely more quickly). Query planning with Space Filling Curves
  • 23. Server-Side Optimizations Filtering and transforming records ● Pushing down data filters ○ Z2/Z3 filter ○ CQL Filters ● Projections
  • 24. Filtering and transforming records overview Using Accumulo iterators and HBase filters, it is possible to filter and map over the key-values pairs scanned. This will let us apply fine-grained spatial filtering, filter by secondary predicates, and implement projections.
  • 25. Pushing down filters Let’s consider a query for tankers which are inside a bounding box for a given time period. GeoMesa’s Z3 index is designed to provide a set of key ranges to scan which will cover the spatio-temporal range. Additional information such as the vessel type is part of the value. Using server-side programming, we can teach Accumulo and HBase how to understand the records and filter out undesirable records. This reduces network traffic and distributes the work.
  • 26. Projection To handle projections in a query, Accumulo Iterators and HBase Filters can change the returned key-value pairs. Changing the key is a bad idea. Changing the value allows for GeoMesa to return a subset of the columns that a user is requesting.
  • 27. GeoMesa Server-Side Filters ● Z2/Z3 filter ○ Scan ranges are not decomposed enough to be very accurate - fast bit-wise comparisons on the row key to filter out-of-bounds data ● CQL/Transform filter ○ If a predicate is not handled by the scan ranges or Z filters, then slower GeoTools CQL filters are applied to the serialized SimpleFeature in the row value ○ Relational projections (transforms) applied to reduce the amount of data sent back ● Other specialized filters ○ Age-off for expiring rows based on a SimpleFeature attribute ○ Attribute-key-value for populating a partial SimpleFeature with an attribute value from the row ○ Visibility filter for merging columns into a SimpleFeature when using attribute-level visibilities
  • 29. Aggregations Using Accumulo Iterators and HBase coprocessors, it is possible to aggregate multiple key-value pairs into one response. Effectively, this lets one implement map and reduce algorithms. These aggregations include computing heatmaps, stats, and custom data formats. The ability to aggregate data can be composed with filtering and projections.
  • 30. GeoMesa Aggregation Abstractions Aggregation logic is implemented in a shared module, based on a lifecycle of 1. Initialization 2. observing some number of features 3. aggregating a result. This paradigm is easily adapted to the specific implementations required by Accumulo and HBase. Notably, all the algorithms we describe work in a single pass over the data.
  • 31. GeoMesa Aggregation Abstractions The main logic is contained in the AggregatingScan class:
  • 32. Visualization Example: Heatmaps Without powerful visualization options, big data is big nonsense. Consider this view of shipping in the Mediterranean sea
  • 33. Visualization Example: Heatmaps Without powerful visualization options, big data is big nonsense. Consider this view of shipping in the Mediterranean sea
  • 34. Generating Heatmaps Heatmaps are implemented in DensityScan. For the scan, we set up a 2D grid array representing the pixels to be displayed. On the region/tablet servers, each feature increments the count of any cells intersecting its geometry. The resulting grid is returned as a serialized array of 64-bit integers, minimizing the data transfer back to the client. The client process merges the grids from each scan range, then normalizes the data to produce an image. Since less data is transmitted, heatmaps are generally faster.
  • 35. Statistical Queries We support a flexible stats API that includes counts, min/max values, enumerations, top-k (StreamSummary), frequency (CountMinSketch), histograms and descriptive statistics. We use well-known streaming algorithms backed by data structures that can be serialized and merged together. Statistical queries are implemented in StatsScan. On the region/tablet servers, we set up the data structure and then add each feature as we scan. The client receives the serialized stats, merges them together, and displays them as either JSON or a Stat instance that can be accessed programmatically.
  • 36. Arrow Format Apache Arrow is a columnar, in-memory data format that GeoMesa supports as an output type. In particular, it can be used to drive complex in-browser visualizations. Arrow scans are implemented in ArrowScan. With Arrow, the data returned from the region/tablet servers is similar in size to a normal query. However, the processing required to generate Arrow files can be distributed across the cluster instead of being done in the client. As we scan, each feature is added to an in-memory Arrow vector. When we hit the configured batch size, the current vector is serialized into the Arrow IPC format and sent back to the client. All the client needs to do is to create a header and then concatenate the batches into a single response.
  • 38. Row Values Our first approach was to store each SimpleFeature attribute in a separate column. However, this proved to be slow to scan. Even when skipping columns for projections, they are still loaded off disk. Column groups seem promising, but they kill performance if you query more than one.
  • 39. Row Values Our second (and current) approach is to store the entire serialized SimpleFeature in one column. Further optimizations: ● Lazy deserialization - SimpleFeature implementation that wraps the row value and only deserializes each attribute as needed ● Feature ID is already stored in the row key to prevent row collisions, so don’t also store it in the row value ● Use BSON for JSON serialization, along with JsonPath extractors ● Support for TWKB geometry serialization to save space
  • 40. Tables/Compactions When dealing with streaming data sources, continuously writing data to a table will cause a lot of compactions. Table partitioning can mitigate this by creating a new table per time period (e.g. day/week), extracted from the SimpleFeature. Generally only the most recent table(s) will be compacted. For frequent updates to existing features, the GeoMesa Lambda store uses Kafka as a medium-term cache before persisting to the key-value store. This reduces the cluster load significantly.
  • 41. Accumulo Server Side Programming ● Accumulo Iterator Review ● GeoMesa’s Accumulo iteraors
  • 42. “Iterators provide a modular mechanism for adding functionality to be executed by TabletServers when scanning or compacting data. This allows users to efficiently summarize, filter, and aggregate data.” -- Accumulo 1.7 documentation Part of the modularity is that the iterators can be stacked: t the output of one can be wired into the next. Example: The first iterator might read from disk, the second could filter with Authorizations, and a final iterator could filter by column family. Other notes: ● Iterators provided a sorted view of the key/values. Accumulo Iterators
  • 43. A request to GeoMesa consists of two broad pieces: 1. A filter restricting the data to act on, e.g.: a. Records in Maryland with ‘Accumulo’ in the text field. b. Records during the first week of 2016. 2. A request for ‘how’ to return the data, e.g.: a. Return the full records b. Return a subset of the record (either a projection or ‘bin’ file format) c. Return a histogram d. Return a heatmap / kernel density Generally, a filter can be handled partially by selecting which ranges to scan; the remainder can be handled by an Iterator. Modifications to selected data can also be handled by a GeoMesa Iterator. GeoMesa Data Requests
  • 44. The first pass of GeoMesa iterators separated concerns into separate iterators. The GeoMesa query planner assembled a stack of iterators to achieve the desired result. Initial GeoMesa Iterator design Image from “Spatio-temporal Indexing in Non-relational Distributed Databases” by Anthony Fox, Chris Eichelberger, James Hughes, Skylar Lyon
  • 45. The key benefit to having decomposed iterators is that they are easier to understand and re-mix. In terms of performance, each one needs to understand the bytes in the Key and Value. In many cases, this will lead to additional serialization/deserialization. Now, we prefer to write Iterators which handle transforming the underlying data into what the client code is expecting in one go. Second GeoMesa Iterator design
  • 46. 1. Using fewer iterators in the stack can be beneficial 2. Using lazy evaluation / deserialization for filtering Values can power speed improvements. 3. Iterators take in Sorted Keys + Values and *must* produce Sorted Keys and Values. Lessons learned about Iterators
  • 47. HBase Server Side Programming ● HBase Filter and Coprocessor Review ● GeoMesa HBase Filter ● GeoMesa HBase Coprocessor
  • 48. HBase Filter Info HBase filters are restricted to the ability to skip/include rows, and to transform a row before returning it. Anything more complicated requires a Coprocessor. In contrast to Accumulo, where iterators are configured with a map of options, HBase requires custom serialization code for each filter implementation.
  • 49. HBase Filter Info The main GeoMesa filters are: ● org.locationtech.geomesa.hbase.filters.CqlTransformFilter ○ Filters rows based on GeoTools CQL ○ Transforms rows based on relational projections ● org.locationtech.geomesa.hbase.filters.Z2HBaseFilter ○ Compares Z-values against the row key ● org.locationtech.geomesa.hbase.filters.Z3HBaseFilter ○ Compares Z-values against the row key
  • 50. HBase Coprocessor Info Coprocessors are not trivial to implement or invoke, and can starve your cluster if it is not configured correctly. GeoMesa implements a harness to invoke a coprocessor, receive the results, and handle any errors: ● org.locationtech.geomesa.hbase.coprocessor.GeoMesaCoprocessor An adapter layer links the common aggregating code to the coprocessor API: ● org.locationtech.geomesa.hbase.coprocessor.aggregators.HBaseAggregato r
  • 51. HBase Coprocessor Info GeoMesa defines a single Protobuf coprocessor endpoint, modeled around the Accumulo iterator lifecycle. The aggregator class and a map of options are passed to the endpoint. Each aggregating scan requires a trivial adapter implementation: ● HBaseDensityAggregator ● HBaseStatsAggregator ● HBaseArrowAggregator
  • 52. Thanks! James Hughes ● jhughes@ccri.com ● http://geomesa.org ● http://gitter.im/locationtech/geomesa
  • 53. Backup Slides Integration with MapReduce / Spark ● GeoMesa + Spark Setup ● GeoMesa + Spark Analytics ● GeoMesa powered notebooks (Jupyter and Zeppelin)
  • 54. Using Accumulo Iterators, we’ve seen how one can easily perform simple ‘MapReduce’ style jobs without needing more infrastructure. NB: Those tasks are limited. One can filter inputs, transform/map records and aggregate partial results on each tablet server. To implement more complex processes, we look to MapReduce and Spark. GeoMesa MapReduce and Spark Support
  • 55. Using Accumulo Iterators, we’ve seen how one can easily perform simple ‘MapReduce’ style jobs without needing more infrastructure. NB: Those tasks are limited. One can filter inputs, transform/map records and aggregate partial results on each tablet server. To implement more complex processes, we look to MapReduce and Spark. Accumulo Implements the MapReduce InputFormat interface. GeoMesa MapReduce and Spark Support
  • 56. Using Accumulo Iterators, we’ve seen how one can easily perform simple ‘MapReduce’ style jobs without needing more infrastructure. NB: Those tasks are limited. One can filter inputs, transform/map records and aggregate partial results on each tablet server. To implement more complex processes, we look to MapReduce and Spark. Accumulo Implements the MapReduce InputFormat interface. Spark provides a way to change InputFormats into RDDs. GeoMesa MapReduce and Spark Support
  • 57. Using Accumulo Iterators, we’ve seen how one can easily perform simple ‘MapReduce’ style jobs without needing more infrastructure. NB: Those tasks are limited. One can filter inputs, transform/map records and aggregate partial results on each tablet server. To implement more complex processes, we look to MapReduce and Spark. Accumulo Implements the MapReduce InputFormat interface. GeoMesa MapReduce and Spark Support
  • 58. GeoMesa Spark Example 1: Time SeriesStep 1: Get an RDD[SimpleFeature] Step 2: Calculate the time series Step 3: Plot the time series in R.
  • 59. Using one dataset (country boundaries) to group another (here, GDELT) is effectively a join. Our summer intern, Atallah, worked out the details of doing this analysis in Spark and created a tutorial and blog post. This picture shows ‘stability’ of a region from GDELT Goldstein values GeoMesa Spark Example 2: Aggregating by Regions http://www.ccri.com/2016/08/17/new-geomesa-tutorial-aggregating-visualizing-data/ http://www.geomesa.org/documentation/tutorials/shallow-join.html
  • 60. GeoMesa Spark Example 3: Aggregating Tweets about #traffic Virginia Polygon CQL GeoMesa RDD Aggregate by County Calculate ratio of #traffic Store back to GeoMesa
  • 61. GeoMesa Spark Example 3: Aggregating Tweets about #traffic #traffic by Virginia county Darker blue has a higher count
  • 62. Interactive Data Discovery at Scale in GeoMesa Notebooks Writing (and debugging!) MapReduce / Spark jobs is slow and requires expertise. A long development cycle for an analytic saps energy and creativity. The answer to both is interactive ‘notebook’ servers like Apache Zeppelin and Jupyter (formerly iPython Notebook).
  • 63. Interactive Data Discovery at Scale in GeoMesa Notebooks Writing (and debugging!) MapReduce / Spark jobs is slow and requires expertise. A long development cycle for an analytic saps energy and creativity. The answer to both is interactive ‘notebook’ servers like Apache Zeppelin and Jupyter There are two big things to work out: 1. Getting the right libraries on the classpath. 2. Wiring up visualizations.
  • 64. Interactive Data Discovery at Scale in GeoMesa Notebooks GeoMesa Notebook Roadmap: ● Improved JavaScript integration ● D3.js and other visualization libraries ● OpenLayers and Leaflet ● Python Bindings
  • 66. Most approaches to indexing non-point geometries involve covering the geometry with a number of grid cells and storing a copy with each index. This means that the client has to deduplicate results which is expensive. Indexing non-point geometries: XZ Index
  • 67. Most approaches to indexing non-point geometries involve covering the geometry with a number of grid cells and storing a copy with each index. This means that the client has to deduplicate results which is expensive. Böhm, Klump, and Kriegel describe an indexing strategy allows such geometries to be stored once. GeoMesa has implemented this strategy in XZ2 (spatial-only) and XZ3 (spatio-temporal) tables. The key is to store data by resolution, separate geometries by size, and then index them by their lower left corner. This does require consideration on the query planning side, but avoiding deduplication is worth the trade-off. Indexing non-point geometries: XZ Index For more details, see Böhm, Klump, and Kriegel. “XZ-ordering: a space-filling curve for objects with spatial extension.” 6th. Int. Symposium on Large Spatial Databases (SSD), 1999, Hong Kong, China. (http://www.dbs.ifi.lmu.de/Publikationen/Boehm/Ordering_99.pdf)
  • 69. Here the viewport is used as the spatial bounds for the query. The time range is a 12 hour period on Monday. Query by bounding box
  • 70. Query by polygon Here we further restrict the query region by an arbitrary polygon
  • 71. Query by polygon and vessel type Here, we have added a clause to restrict to cargo vessels
  • 72. Query by polygon and vessel type (heatmap) Heatmaps can be generated
  • 73. Query by polygon and vessel type (Apache Arrow format) Data can be returned in a number of formats. The Apache Arrow format allows for rapid access in JavaScript. Here, points are colored by callsign.
  • 74. Query by polygon and vessel type (Apache Arrow format) Apache Arrow allows for in browser data exploration. This histogram shows callsigns grouped by country. Selections in the histogram can influence the map.