OLAP is an acronym for online analytical processing. It focuses on reporting and in a broader sense, it is about answering schema oriented queries quickly. Queries could be “how many distinct infections seen for a threat in a given month” or “what is the maximum duration in last month that a particular infection was seen in my enterprise”.
Contrast this to OLTP or online transaction processing where storing a fast stream of transactional elements is more important.
If we talk about OLAP, Star Schema is the first thing that comes to mind. In a relational OLAP world, Star Schema is an important concept. Modeling OLAP data in Star Schema format means segregating data into Fact and Dimension tables. The central table represents couple of dimensions which constitutes a fact and one or more measures which we try to calculate. Measure is often a derived field and can be deduced with SQL queries like group by and aggregate functions.
We use Spark and HBase to implement a Hybrid OLAP system. We call it hybrid because we store data in both relational(ROLAP) and multi-dimensional (MOLAP) format.
MOLAP materialization can be best visualized as a lattice. Each of the circular points here is called Tile or Cuboid. Each of the tiles can be thought to be equivalent of Group By clause in SQL, aggregates like Sum or Count are implicit and not shown in the diagram. If we are reading the lattice from bottom to top we are skipping one field out of the 3 fields (Infection_type,country,monthId). The 2-D cuboids are based on dropping one field at a time. This is called roll up. Conversely if we start from the top i.e. 0-D cuboid and move downwards we are grouping by on one field, this is called drill down. There are various literature on how to do this rollup and drilldown efficiently and which cuboids to materialize. I would strongly recommend Han and Kamber's Data Mining book and the lattice paper by Harinarayan et al for deep understanding of this domain.
2. Bangladesh Bank Chief Resigns After Cyber Theft of $81 million
New York Times (Mar 15,2016)
Cybercrime is a key fraud risk in India
ey.com (Jan 20,2016)
Target settles for $39 million over data breach
Cnn.com (Dec 2,2015)
Anthem is warning consumers about its huge data breach
Los Angeles Times.com (Mar,2015)
Ashley Madison
Anyone Here !!
Why Should You Care !
3. Incident Response
Identify root cause and fix vulnerabilities
Intrusion Detection
Monitor network and systems for malicious activities
Alert Prioritization
Reduce false positives to stop the threat with highest impact
Predicting Compromises
Predict attacks based on vulnerability, command & control activity and past infections
Access Analytics
Isolate unusual user behavior e.g. concurrent geographical login
Simulation
Simulate various attacks by doing internal pen testing and take precautions based on log mining
Simulate insider attack on data loss prevention software and take precautions based on its logs
What is Security Analytics
4. No real time query on Petabytes
Reduce data in stages like a funnel
Web Scale - Dealing with Petabytes
Streaming
Logs
Kafka
Log Parser HiveSemi
Aggregates
HBase
MOLAP CubesKafka Client
5. Relational OLAP (ROLAP)
SQL kind of queries from client front-end tools for a relational back-end
database.
ROLAP servers include optimization for each DBMS back end,
implementation of aggregation navigation logic, and additional tools and
services
ROLAP technology tends to have greater scalability than MOLAP technology
Multi-dimensional OLAP (MOLAP)
Query materialized views , think about Partially Ordered Sets (POSET)
The advantage of using a data cube is that it allows fast indexing to pre-
computed summarized data and usually much faster than ROLAP
Difficult to scale because of “curse of dimensionality”
Hybrid OLAP
6. Visualization of MOLAP as Lattice
O-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D (base) cuboid
Infection_type
monthId
country
(Infection_type,monthId)
(country,monthId)
(Infection_type,country)
(Infection_type,country,monthId)
8. Hyperloglog
Used for approximate count distinct queries
Store HLL hash in 5 bytes in HBase columns
Apply monoid SUM pattern to rollup
Bloom Filter
Used for checking whether an incoming stream element is “not” a member of a set
False negative never happens, i.e. an element “definitely not in set” is always
correct
Also used by Hbase to ascertain whether input row key is part of a Hfile
Count-Min Sketch
Used for counting frequencies of specific elements in sub-linear space
Twitter’s Algebird library with Spark for HLL and CMS implementation
Probabilistic Data Structures
9. Real-Time Query Response Server
Query Controller
Calcite HBase
Adapter Yes
Spark Driver on
Jetty
No
SparkSQLQuery
Is
Cuboid
Found
?
HDFS/Hive/HBase
Incoming
Query Response
HBaseQuery