SlideShare uma empresa Scribd logo
1 de 13
© 2015 MapR Technologies 1
Follow me at @joebluems for link to code © 2015 MapR Technologies
Breach Detection with Apache Drill
© 2015 MapR Technologies 2
Follow me at @joebluems for link to code
Breach Happens!
© 2015 MapR Technologies 3
Follow me at @joebluems for link to code
Customer transactions – M-F
Sat.
Status
✔
✔
✖
✔
✖
Finding the Source of Compromise*
* The source of the compromise may not be
where the fraudsters use the accounts
millions of
customers
millions of
merchant
locations
© 2015 MapR Technologies 4
Follow me at @joebluems for link to code
Apache Drill
linux> head -5 sample.json
{acct:"0",merchant:"6998",fraud:"0"}
{acct:"0",merchant:"1269",fraud:"0"}
{acct:"0",merchant:"4286",fraud:"0"}
{acct:"0",merchant:"2371",fraud:"0"}
{acct:"0",merchant:"4545",fraud:"0"}
<drill home>/bin/drill-embedded
drill> select * from `dfs`.`sample.json` limit 5;
+-------+-----------+--------+
| acct | merchant | fraud |
+-------+-----------+--------+
| 0 | 6998 | 0 |
| 0 | 1269 | 0 |
| 0 | 4286 | 0 |
| 0 | 2371 | 0 |
| 0 | 4545 | 0 |
+-------+-----------+--------+
• https://drill.apache.org
• “Schema-free SQL Query
Engine for Hadoop, NoSQL
and Cloud Storage”
• Write SQL queries to access
distributed files without
specifying a schema
• Note: use the backtick in the
SQL (not a single quote)
© 2015 MapR Technologies 5
Follow me at @joebluems for link to code
Scoring Merchants with Log Likelihood
LL = 2* yij log
j=1
2
å
i=1
2
å
yij
mij
æ
è
çç
ö
ø
÷÷
14.3
10 0
0 10,000
1 1
0.9013 1,000
1,000 100,000
2 2
NO
T
M2
NO
T
M1
FRAUD
NOT
FRAUD
FRAUD
NOT
FRAUD
• Measures how much fraud
we observed beyond what
should happen randomly
• Fraud counts alone do not
account for the popularity
of common merchants
© 2015 MapR Technologies 6
Follow me at @joebluems for link to code
Drill – Count All Frauds / Non-Frauds
select sum(totalFraud) as `countFraud`,
sum(totalNonFraud) as `countNonFraud` from
( select
case when fraud='1' then 1 else 0 end as `totalFraud`,
case when fraud='0' then 1 else 0 end as `totalNonFraud`
from ( select distinct acct,fraud from `dfs`.`sample.json`)
);
+-------------+----------------+
| countFraud | countNonFraud |
+-------------+----------------+
| 5000 | 95000 |
+-------------+----------------+
© 2015 MapR Technologies 7
Follow me at @joebluems for link to code
Drill – Count Frauds at Each Merchant
select merchant, sum(merchFraud) as `merchCountFraud`,
sum(merchNonFraud) as `merchCountNonFraud` from
(select merchant,
case when fraud='1' then 1 else 0 end as `merchFraud`,
case when fraud='0' then 1 else 0 end as `merchNonFraud`
from `dfs`.`sample.json`)
group by merchant
limit 5;
+-----------+------------------+---------------------+
| merchant | merchCountFraud | merchCountNonFraud |
+-----------+------------------+---------------------+
| 6998 | 11 | 121 |
| 1269 | 8 | 130 |
| 4286 | 1 | 116 |
| 2371 | 7 | 124 |
| 4545 | 4 | 133 |
+-----------+------------------+---------------------+
© 2015 MapR Technologies 8
Follow me at @joebluems for link to code
Drill UDF (Java) to calculate Log-Likelihood
public void eval() {
float ll = (float) 0.0;
int n12 = n1t.value - n11.value;
int n22 = n2t.value - n21.value;
int nt1 = n11.value + n21.value;
int nt2 = n12 + n22;
int nt = nt1 + nt2;
// calculate LL for non-zero elements
if (n11.value > 0) {
ll += n11.value * Math.log(n11.value / ((float) n1t.value * nt1 /nt)); }
if (n21.value > 0) {
ll += n21.value * Math.log(n21.value / ((float) n2t.value * nt1 / nt));}
if (n12 > 0) {
ll += (float) n12 * Math.log(n12 / ((float) n1t.value * nt2 / nt)); }
if (n22 > 0) {
ll += (float) n22 * Math.log(n22 / ((float) n2t.value * nt2 / nt)); }
// if the fraud rate is less than non-fraud rate, set LL to zero
if (n11.value/ (float)(n11.value+n21.value)<(n12/(float)(n12 + n22))) ll=0;
out.value = ll;
}
© 2015 MapR Technologies 9
Follow me at @joebluems for link to code
Putting it all together
select MERCH.merchant, MERCH.merchCountFraud as `n11`, MERCH.merchCountNonFraud as `n21`,
COUNTS.countFraud as `n1dot`, COUNTS.countNonFraud as `n2dot`,
loglikelihood(cast(MERCH.merchCountFraud as INT),
cast(MERCH.merchCountNonFraud as INT),
cast(COUNTS.countFraud as INT),
cast(COUNTS.countNonFraud as INT)) as `logLike` from (
select 1 as `dummy`,merchant,
sum(merchFraud) as `merchCountFraud`, sum(merchNonFraud) as `merchCountNonFraud`
from (select merchant, case when fraud='1' then 1 else 0 end as `merchFraud`,
case when fraud='0' then 1 else 0 end as `merchNonFraud` from `dfs`.`sample.json`
) group by merchant) `MERCH`
JOIN ( select 1 as `dummy`, sum(totalFraud) as `countFraud`,
sum(totalNonFraud) as `countNonFraud` from
( select case when fraud='1' then 1 else 0 end as `totalFraud`,
case when fraud='0' then 1 else 0 end as `totalNonFraud`
from ( select distinct acct,fraud from `dfs`.`sample.json`)
)) `COUNTS`
on MERCH.dummy=COUNTS.dummy
ORDER by loglike desc
limit 10;
© 2015 MapR Technologies 10
Follow me at @joebluems for link to code
Output from Previous Query…
+-----------+------+------+--------+--------+---------------------+
| merchant | n11 | n21 | n1dot | n2dot | logLike |
+-----------+------+------+--------+--------+---------------------+
| 5902 | 16 | 95 | 5000 | 95000 | 7.0296311378479 |
| 4666 | 17 | 118 | 5000 | 95000 | 5.880885601043701 |
| 3486 | 16 | 107 | 5000 | 95000 | 5.8762335777282715 |
| 7961 | 16 | 108 | 5000 | 95000 | 5.793434143066406 |
| 9182 | 16 | 110 | 5000 | 95000 | 5.631403923034668 |
| 7114 | 13 | 81 | 5000 | 95000 | 5.324999809265137 |
| 2127 | 16 | 115 | 5000 | 95000 | 5.222985744476318 |
| 1462 | 16 | 115 | 5000 | 95000 | 5.222985744476318 |
| 2994 | 14 | 94 | 5000 | 95000 | 5.113578796386719 |
| 5770 | 16 | 117 | 5000 | 95000 | 5.064565181732178 |
+-----------+------+------+--------+--------+---------------------+
© 2015 MapR Technologies 11
Follow me at @joebluems for link to code
Breaking Breaches
• Real-life example
• SQL output is
processed into
histogram
• Tableau chart
shows number of
merchants per
Breach score
© 2015 MapR Technologies 12
Follow me at @joebluems for link to code © 2014 MapR Technologies
Appendix
© 2015 MapR Technologies 13
Follow me at @joebluems for link to code
Additional Info
• Location of Code/Data Repository
– https://github.com/joebluems/BreachDetection
• Link to Blog on Breach Detection
– https://www.mapr.com/blog/identify-your-data-breach-apache-drill
• A little more on Log-Likelihood
– http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
• Drill
– Documentation: http://drill.apache.org/docs/
– UDFs: https://drill.apache.org/docs/deploying-and-using-a-hive-udf/
– Code for sample UDF: https://github.com/viadea/HiveUDF

Mais conteúdo relacionado

Destaque

Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningMapR Technologies
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillMapR Technologies
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillMapR Technologies
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR Technologies
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureMapR Technologies
 

Destaque (9)

Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache Drill
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache Drill
 
IoT Use Cases with MapR
IoT Use Cases with MapRIoT Use Cases with MapR
IoT Use Cases with MapR
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data Platform
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data Architecture
 

Mais de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Mais de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Último

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Último (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Free Code Friday - Identify Your Data Breach with Apache Drill

  • 1. © 2015 MapR Technologies 1 Follow me at @joebluems for link to code © 2015 MapR Technologies Breach Detection with Apache Drill
  • 2. © 2015 MapR Technologies 2 Follow me at @joebluems for link to code Breach Happens!
  • 3. © 2015 MapR Technologies 3 Follow me at @joebluems for link to code Customer transactions – M-F Sat. Status ✔ ✔ ✖ ✔ ✖ Finding the Source of Compromise* * The source of the compromise may not be where the fraudsters use the accounts millions of customers millions of merchant locations
  • 4. © 2015 MapR Technologies 4 Follow me at @joebluems for link to code Apache Drill linux> head -5 sample.json {acct:"0",merchant:"6998",fraud:"0"} {acct:"0",merchant:"1269",fraud:"0"} {acct:"0",merchant:"4286",fraud:"0"} {acct:"0",merchant:"2371",fraud:"0"} {acct:"0",merchant:"4545",fraud:"0"} <drill home>/bin/drill-embedded drill> select * from `dfs`.`sample.json` limit 5; +-------+-----------+--------+ | acct | merchant | fraud | +-------+-----------+--------+ | 0 | 6998 | 0 | | 0 | 1269 | 0 | | 0 | 4286 | 0 | | 0 | 2371 | 0 | | 0 | 4545 | 0 | +-------+-----------+--------+ • https://drill.apache.org • “Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage” • Write SQL queries to access distributed files without specifying a schema • Note: use the backtick in the SQL (not a single quote)
  • 5. © 2015 MapR Technologies 5 Follow me at @joebluems for link to code Scoring Merchants with Log Likelihood LL = 2* yij log j=1 2 å i=1 2 å yij mij æ è çç ö ø ÷÷ 14.3 10 0 0 10,000 1 1 0.9013 1,000 1,000 100,000 2 2 NO T M2 NO T M1 FRAUD NOT FRAUD FRAUD NOT FRAUD • Measures how much fraud we observed beyond what should happen randomly • Fraud counts alone do not account for the popularity of common merchants
  • 6. © 2015 MapR Technologies 6 Follow me at @joebluems for link to code Drill – Count All Frauds / Non-Frauds select sum(totalFraud) as `countFraud`, sum(totalNonFraud) as `countNonFraud` from ( select case when fraud='1' then 1 else 0 end as `totalFraud`, case when fraud='0' then 1 else 0 end as `totalNonFraud` from ( select distinct acct,fraud from `dfs`.`sample.json`) ); +-------------+----------------+ | countFraud | countNonFraud | +-------------+----------------+ | 5000 | 95000 | +-------------+----------------+
  • 7. © 2015 MapR Technologies 7 Follow me at @joebluems for link to code Drill – Count Frauds at Each Merchant select merchant, sum(merchFraud) as `merchCountFraud`, sum(merchNonFraud) as `merchCountNonFraud` from (select merchant, case when fraud='1' then 1 else 0 end as `merchFraud`, case when fraud='0' then 1 else 0 end as `merchNonFraud` from `dfs`.`sample.json`) group by merchant limit 5; +-----------+------------------+---------------------+ | merchant | merchCountFraud | merchCountNonFraud | +-----------+------------------+---------------------+ | 6998 | 11 | 121 | | 1269 | 8 | 130 | | 4286 | 1 | 116 | | 2371 | 7 | 124 | | 4545 | 4 | 133 | +-----------+------------------+---------------------+
  • 8. © 2015 MapR Technologies 8 Follow me at @joebluems for link to code Drill UDF (Java) to calculate Log-Likelihood public void eval() { float ll = (float) 0.0; int n12 = n1t.value - n11.value; int n22 = n2t.value - n21.value; int nt1 = n11.value + n21.value; int nt2 = n12 + n22; int nt = nt1 + nt2; // calculate LL for non-zero elements if (n11.value > 0) { ll += n11.value * Math.log(n11.value / ((float) n1t.value * nt1 /nt)); } if (n21.value > 0) { ll += n21.value * Math.log(n21.value / ((float) n2t.value * nt1 / nt));} if (n12 > 0) { ll += (float) n12 * Math.log(n12 / ((float) n1t.value * nt2 / nt)); } if (n22 > 0) { ll += (float) n22 * Math.log(n22 / ((float) n2t.value * nt2 / nt)); } // if the fraud rate is less than non-fraud rate, set LL to zero if (n11.value/ (float)(n11.value+n21.value)<(n12/(float)(n12 + n22))) ll=0; out.value = ll; }
  • 9. © 2015 MapR Technologies 9 Follow me at @joebluems for link to code Putting it all together select MERCH.merchant, MERCH.merchCountFraud as `n11`, MERCH.merchCountNonFraud as `n21`, COUNTS.countFraud as `n1dot`, COUNTS.countNonFraud as `n2dot`, loglikelihood(cast(MERCH.merchCountFraud as INT), cast(MERCH.merchCountNonFraud as INT), cast(COUNTS.countFraud as INT), cast(COUNTS.countNonFraud as INT)) as `logLike` from ( select 1 as `dummy`,merchant, sum(merchFraud) as `merchCountFraud`, sum(merchNonFraud) as `merchCountNonFraud` from (select merchant, case when fraud='1' then 1 else 0 end as `merchFraud`, case when fraud='0' then 1 else 0 end as `merchNonFraud` from `dfs`.`sample.json` ) group by merchant) `MERCH` JOIN ( select 1 as `dummy`, sum(totalFraud) as `countFraud`, sum(totalNonFraud) as `countNonFraud` from ( select case when fraud='1' then 1 else 0 end as `totalFraud`, case when fraud='0' then 1 else 0 end as `totalNonFraud` from ( select distinct acct,fraud from `dfs`.`sample.json`) )) `COUNTS` on MERCH.dummy=COUNTS.dummy ORDER by loglike desc limit 10;
  • 10. © 2015 MapR Technologies 10 Follow me at @joebluems for link to code Output from Previous Query… +-----------+------+------+--------+--------+---------------------+ | merchant | n11 | n21 | n1dot | n2dot | logLike | +-----------+------+------+--------+--------+---------------------+ | 5902 | 16 | 95 | 5000 | 95000 | 7.0296311378479 | | 4666 | 17 | 118 | 5000 | 95000 | 5.880885601043701 | | 3486 | 16 | 107 | 5000 | 95000 | 5.8762335777282715 | | 7961 | 16 | 108 | 5000 | 95000 | 5.793434143066406 | | 9182 | 16 | 110 | 5000 | 95000 | 5.631403923034668 | | 7114 | 13 | 81 | 5000 | 95000 | 5.324999809265137 | | 2127 | 16 | 115 | 5000 | 95000 | 5.222985744476318 | | 1462 | 16 | 115 | 5000 | 95000 | 5.222985744476318 | | 2994 | 14 | 94 | 5000 | 95000 | 5.113578796386719 | | 5770 | 16 | 117 | 5000 | 95000 | 5.064565181732178 | +-----------+------+------+--------+--------+---------------------+
  • 11. © 2015 MapR Technologies 11 Follow me at @joebluems for link to code Breaking Breaches • Real-life example • SQL output is processed into histogram • Tableau chart shows number of merchants per Breach score
  • 12. © 2015 MapR Technologies 12 Follow me at @joebluems for link to code © 2014 MapR Technologies Appendix
  • 13. © 2015 MapR Technologies 13 Follow me at @joebluems for link to code Additional Info • Location of Code/Data Repository – https://github.com/joebluems/BreachDetection • Link to Blog on Breach Detection – https://www.mapr.com/blog/identify-your-data-breach-apache-drill • A little more on Log-Likelihood – http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html • Drill – Documentation: http://drill.apache.org/docs/ – UDFs: https://drill.apache.org/docs/deploying-and-using-a-hive-udf/ – Code for sample UDF: https://github.com/viadea/HiveUDF

Notas do Editor

  1. Depends on size and overlap. Significance is measured in overlap beyond expected. 1 vs. 2. – both rare items so wouldn’t expect much overlap, but we see total (slightly askew to show both circles) 3 vs. 4 – popular items, so expect higher number of overlap Can distribute these calculations (map-reduce, Mahout, Spark, etc.)