O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Free Code Friday - Identify Your Data Breach with Apache Drill

Numerous big data methods have been unable to eradicate fraud completely. It’s important to score customer transactions to prevent the takeover, but crucial information about where the accounts were intercepted may be lurking in plain sight, completely overlooked.

In just a few simple steps, you can analyze your data to find the source of compromise. Join this session of Free Code Fridays where you'll get to hear from Joe Blue, Data Scientist at MapR. You'll learn how to use Apache Drill to analyze massive amounts of semi-structured transactions in seconds using the Map-Reduce model, and shut down a breach before it does real damage.

  • Entre para ver os comentários

Free Code Friday - Identify Your Data Breach with Apache Drill

  1. 1. © 2015 MapR Technologies 1 Follow me at @joebluems for link to code © 2015 MapR Technologies Breach Detection with Apache Drill
  2. 2. © 2015 MapR Technologies 2 Follow me at @joebluems for link to code Breach Happens!
  3. 3. © 2015 MapR Technologies 3 Follow me at @joebluems for link to code Customer transactions – M-F Sat. Status ✔ ✔ ✖ ✔ ✖ Finding the Source of Compromise* * The source of the compromise may not be where the fraudsters use the accounts millions of customers millions of merchant locations
  4. 4. © 2015 MapR Technologies 4 Follow me at @joebluems for link to code Apache Drill linux> head -5 sample.json {acct:"0",merchant:"6998",fraud:"0"} {acct:"0",merchant:"1269",fraud:"0"} {acct:"0",merchant:"4286",fraud:"0"} {acct:"0",merchant:"2371",fraud:"0"} {acct:"0",merchant:"4545",fraud:"0"} <drill home>/bin/drill-embedded drill> select * from `dfs`.`sample.json` limit 5; +-------+-----------+--------+ | acct | merchant | fraud | +-------+-----------+--------+ | 0 | 6998 | 0 | | 0 | 1269 | 0 | | 0 | 4286 | 0 | | 0 | 2371 | 0 | | 0 | 4545 | 0 | +-------+-----------+--------+ • https://drill.apache.org • “Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage” • Write SQL queries to access distributed files without specifying a schema • Note: use the backtick in the SQL (not a single quote)
  5. 5. © 2015 MapR Technologies 5 Follow me at @joebluems for link to code Scoring Merchants with Log Likelihood LL = 2* yij log j=1 2 å i=1 2 å yij mij æ è çç ö ø ÷÷ 14.3 10 0 0 10,000 1 1 0.9013 1,000 1,000 100,000 2 2 NO T M2 NO T M1 FRAUD NOT FRAUD FRAUD NOT FRAUD • Measures how much fraud we observed beyond what should happen randomly • Fraud counts alone do not account for the popularity of common merchants
  6. 6. © 2015 MapR Technologies 6 Follow me at @joebluems for link to code Drill – Count All Frauds / Non-Frauds select sum(totalFraud) as `countFraud`, sum(totalNonFraud) as `countNonFraud` from ( select case when fraud='1' then 1 else 0 end as `totalFraud`, case when fraud='0' then 1 else 0 end as `totalNonFraud` from ( select distinct acct,fraud from `dfs`.`sample.json`) ); +-------------+----------------+ | countFraud | countNonFraud | +-------------+----------------+ | 5000 | 95000 | +-------------+----------------+
  7. 7. © 2015 MapR Technologies 7 Follow me at @joebluems for link to code Drill – Count Frauds at Each Merchant select merchant, sum(merchFraud) as `merchCountFraud`, sum(merchNonFraud) as `merchCountNonFraud` from (select merchant, case when fraud='1' then 1 else 0 end as `merchFraud`, case when fraud='0' then 1 else 0 end as `merchNonFraud` from `dfs`.`sample.json`) group by merchant limit 5; +-----------+------------------+---------------------+ | merchant | merchCountFraud | merchCountNonFraud | +-----------+------------------+---------------------+ | 6998 | 11 | 121 | | 1269 | 8 | 130 | | 4286 | 1 | 116 | | 2371 | 7 | 124 | | 4545 | 4 | 133 | +-----------+------------------+---------------------+
  8. 8. © 2015 MapR Technologies 8 Follow me at @joebluems for link to code Drill UDF (Java) to calculate Log-Likelihood public void eval() { float ll = (float) 0.0; int n12 = n1t.value - n11.value; int n22 = n2t.value - n21.value; int nt1 = n11.value + n21.value; int nt2 = n12 + n22; int nt = nt1 + nt2; // calculate LL for non-zero elements if (n11.value > 0) { ll += n11.value * Math.log(n11.value / ((float) n1t.value * nt1 /nt)); } if (n21.value > 0) { ll += n21.value * Math.log(n21.value / ((float) n2t.value * nt1 / nt));} if (n12 > 0) { ll += (float) n12 * Math.log(n12 / ((float) n1t.value * nt2 / nt)); } if (n22 > 0) { ll += (float) n22 * Math.log(n22 / ((float) n2t.value * nt2 / nt)); } // if the fraud rate is less than non-fraud rate, set LL to zero if (n11.value/ (float)(n11.value+n21.value)<(n12/(float)(n12 + n22))) ll=0; out.value = ll; }
  9. 9. © 2015 MapR Technologies 9 Follow me at @joebluems for link to code Putting it all together select MERCH.merchant, MERCH.merchCountFraud as `n11`, MERCH.merchCountNonFraud as `n21`, COUNTS.countFraud as `n1dot`, COUNTS.countNonFraud as `n2dot`, loglikelihood(cast(MERCH.merchCountFraud as INT), cast(MERCH.merchCountNonFraud as INT), cast(COUNTS.countFraud as INT), cast(COUNTS.countNonFraud as INT)) as `logLike` from ( select 1 as `dummy`,merchant, sum(merchFraud) as `merchCountFraud`, sum(merchNonFraud) as `merchCountNonFraud` from (select merchant, case when fraud='1' then 1 else 0 end as `merchFraud`, case when fraud='0' then 1 else 0 end as `merchNonFraud` from `dfs`.`sample.json` ) group by merchant) `MERCH` JOIN ( select 1 as `dummy`, sum(totalFraud) as `countFraud`, sum(totalNonFraud) as `countNonFraud` from ( select case when fraud='1' then 1 else 0 end as `totalFraud`, case when fraud='0' then 1 else 0 end as `totalNonFraud` from ( select distinct acct,fraud from `dfs`.`sample.json`) )) `COUNTS` on MERCH.dummy=COUNTS.dummy ORDER by loglike desc limit 10;
  10. 10. © 2015 MapR Technologies 10 Follow me at @joebluems for link to code Output from Previous Query… +-----------+------+------+--------+--------+---------------------+ | merchant | n11 | n21 | n1dot | n2dot | logLike | +-----------+------+------+--------+--------+---------------------+ | 5902 | 16 | 95 | 5000 | 95000 | 7.0296311378479 | | 4666 | 17 | 118 | 5000 | 95000 | 5.880885601043701 | | 3486 | 16 | 107 | 5000 | 95000 | 5.8762335777282715 | | 7961 | 16 | 108 | 5000 | 95000 | 5.793434143066406 | | 9182 | 16 | 110 | 5000 | 95000 | 5.631403923034668 | | 7114 | 13 | 81 | 5000 | 95000 | 5.324999809265137 | | 2127 | 16 | 115 | 5000 | 95000 | 5.222985744476318 | | 1462 | 16 | 115 | 5000 | 95000 | 5.222985744476318 | | 2994 | 14 | 94 | 5000 | 95000 | 5.113578796386719 | | 5770 | 16 | 117 | 5000 | 95000 | 5.064565181732178 | +-----------+------+------+--------+--------+---------------------+
  11. 11. © 2015 MapR Technologies 11 Follow me at @joebluems for link to code Breaking Breaches • Real-life example • SQL output is processed into histogram • Tableau chart shows number of merchants per Breach score
  12. 12. © 2015 MapR Technologies 12 Follow me at @joebluems for link to code © 2014 MapR Technologies Appendix
  13. 13. © 2015 MapR Technologies 13 Follow me at @joebluems for link to code Additional Info • Location of Code/Data Repository – https://github.com/joebluems/BreachDetection • Link to Blog on Breach Detection – https://www.mapr.com/blog/identify-your-data-breach-apache-drill • A little more on Log-Likelihood – http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html • Drill – Documentation: http://drill.apache.org/docs/ – UDFs: https://drill.apache.org/docs/deploying-and-using-a-hive-udf/ – Code for sample UDF: https://github.com/viadea/HiveUDF

×