SlideShare uma empresa Scribd logo
1 de 15
Low-Latency, Web-scale Fraud
Prevention with Samza and Friends
Edi Bice
ebice@ebay.com
Senior Data Scientist at eBay Enterprise leading
R&D efforts in applying machine learning to
fraud prevention and elsewhere.
Commerce is getting more convenient, more complex, and so is fraud. To
keep up fraud prevention solutions need to process a lot more data
• Older Data
• Looking back a lot further in time
• Older data is not effective excuse – home for the holidays?
• Wider Data
• Using all available data sources
• How wide can customer name possibly be?
• Richer Data
• Social/unstructured data – people, places, interests
• Connected data – who shipped to whom, where; email, devices, IP addresses
• Faster Data
• Clickstream data – website click patterns
Modern Fraud Prevention Architecture Requirements
• Web scale capable (horizontal scaling using commodity hardware)
• handle more actions and data for each user
• handle more users and more volume from each user
• handle more customers of all sizes (lowest processing cost)
• Low latency (milliseconds not hours)
• card present, digital goods, gift cards, store pickup (in-store online shopping!?)
• e-commerce physical goods? – no teleporting yet so speed up what we can
• process customer interactions in real time (personalization, royalty, shopping experience)
• dynamic order process (identification, authentication, tender presentation)
• Fault tolerance
• Commodity hardware is not without faults
• Expect and design for routine failures – more like shift changes, or relay races
Preventing fraud is all about detecting abnormal behavior.
Normal behavior is not normal – we are all normal in our
own abnormal special ways.
• Typical customer profiling calculations
• Transaction velocity (#txns_day) and change (#txns_day_1days/#txns_day_10days)
• Amount velocity ($txns_day) and change ($txns_day_1days/$txns_day_10days)
• Typical implementation and technologies
1. Define sliding window interval (7 days, a month, 6 months?)
2. For each live txn pull matching txns (card, ...) from single SQL DB within that sliding window
3. Loop over pulled transactions filtering based on timestamp to calculate change over sub-windows
• Issues, Problems, Solutions?!
CName Date $ ShipAddr … CTxns CustAvgAmt TxAmt_AvgAmt_Ratio Shipping
Addr Txns
Shipping Addr
Avg Amount
Edi Bice 8/3/15 50 123 Main St 1 50 = (50 + 0) / 1 NA 1 50
Edi Bice 8/3/15 100 123 Main St 2 75 = (100 + 50*1) / 2 2.0 = 100 / 50 2 75
Edi Bice 8/4/15 150 123 Main St 3 100 = (150 + 75*2) / 3 2.0 = 150 / 75 3 100
Edi Bice 8/5/15 1500 999 Wall St 4 450 = (1500 + 100*3) / 4 15.0 = 1500 / 100 1 1500
Streaming Analytics New Avg Amt = (New Txn Amt + Curr Avg Amt
* Curr Num Txns) / (Curr Num Txns + 1)
job pipelines
Kafka, Samza, and the Unix philosophy of distributed data by Martin Kleppmann
Apache Kafka
• Distributed, scalable, publish-subscribe messaging system
• Persistent, high-throughput messaging
• Designed for real time activity stream data processing
PreCog Samza Job Pipeline
Manifold (1-in-N-out) jobs
Risk-by-Y calc jobs
X-by-Y calc jobs
Assembly jobs
FAULT-TOLERANT
LOCAL STATE
Samza job partition 0 Samza job partition 1
Local
RocksDB
Local
RocksDB
Durable changelog Kafka
replicate writes
Embedded key-value: very fast
Machine dies ⇒ local key-value store is lost
Solution: replicate all writes to Kafka!
Machine dies ⇒ restart on another machine
Restore key-value store from changelog
Changelog compaction in the background
Samza Jobs on Hadoop 2.0 (YARN)
Samza App Master
Node Manager
Kafka Broker
Machine 1 Machine 2
Samza TaskRunner: Partition 1
Node Manager
Kafka Broker
aStreamTask:process()
Samza TaskRunner: Partition 2
aStreamTask:process()
Machine 3
Node Manager
Kafka Broker
Samza TaskRunner: Partition 3
aStreamTask:process()
Monitoring Samza: Metrics and More
Samza JMX metrics jmxtrans OpenTSDB/HBase Grafana
Questions?
http://www.ebayenterprise.com/
ebice@ebay.com
@edi_bice
https://www.linkedin.com/in/ebice

Mais conteúdo relacionado

Último

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Último (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 

Destaque

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Low Latency Web Scale Fraud prevention with Apache Samza, Kafka and Friends

  • 2. Edi Bice ebice@ebay.com Senior Data Scientist at eBay Enterprise leading R&D efforts in applying machine learning to fraud prevention and elsewhere.
  • 3. Commerce is getting more convenient, more complex, and so is fraud. To keep up fraud prevention solutions need to process a lot more data • Older Data • Looking back a lot further in time • Older data is not effective excuse – home for the holidays? • Wider Data • Using all available data sources • How wide can customer name possibly be? • Richer Data • Social/unstructured data – people, places, interests • Connected data – who shipped to whom, where; email, devices, IP addresses • Faster Data • Clickstream data – website click patterns
  • 4. Modern Fraud Prevention Architecture Requirements • Web scale capable (horizontal scaling using commodity hardware) • handle more actions and data for each user • handle more users and more volume from each user • handle more customers of all sizes (lowest processing cost) • Low latency (milliseconds not hours) • card present, digital goods, gift cards, store pickup (in-store online shopping!?) • e-commerce physical goods? – no teleporting yet so speed up what we can • process customer interactions in real time (personalization, royalty, shopping experience) • dynamic order process (identification, authentication, tender presentation) • Fault tolerance • Commodity hardware is not without faults • Expect and design for routine failures – more like shift changes, or relay races
  • 5. Preventing fraud is all about detecting abnormal behavior. Normal behavior is not normal – we are all normal in our own abnormal special ways. • Typical customer profiling calculations • Transaction velocity (#txns_day) and change (#txns_day_1days/#txns_day_10days) • Amount velocity ($txns_day) and change ($txns_day_1days/$txns_day_10days) • Typical implementation and technologies 1. Define sliding window interval (7 days, a month, 6 months?) 2. For each live txn pull matching txns (card, ...) from single SQL DB within that sliding window 3. Loop over pulled transactions filtering based on timestamp to calculate change over sub-windows • Issues, Problems, Solutions?!
  • 6.
  • 7. CName Date $ ShipAddr … CTxns CustAvgAmt TxAmt_AvgAmt_Ratio Shipping Addr Txns Shipping Addr Avg Amount Edi Bice 8/3/15 50 123 Main St 1 50 = (50 + 0) / 1 NA 1 50 Edi Bice 8/3/15 100 123 Main St 2 75 = (100 + 50*1) / 2 2.0 = 100 / 50 2 75 Edi Bice 8/4/15 150 123 Main St 3 100 = (150 + 75*2) / 3 2.0 = 150 / 75 3 100 Edi Bice 8/5/15 1500 999 Wall St 4 450 = (1500 + 100*3) / 4 15.0 = 1500 / 100 1 1500 Streaming Analytics New Avg Amt = (New Txn Amt + Curr Avg Amt * Curr Num Txns) / (Curr Num Txns + 1)
  • 8. job pipelines Kafka, Samza, and the Unix philosophy of distributed data by Martin Kleppmann
  • 9. Apache Kafka • Distributed, scalable, publish-subscribe messaging system • Persistent, high-throughput messaging • Designed for real time activity stream data processing
  • 10. PreCog Samza Job Pipeline Manifold (1-in-N-out) jobs Risk-by-Y calc jobs X-by-Y calc jobs Assembly jobs
  • 11.
  • 12. FAULT-TOLERANT LOCAL STATE Samza job partition 0 Samza job partition 1 Local RocksDB Local RocksDB Durable changelog Kafka replicate writes Embedded key-value: very fast Machine dies ⇒ local key-value store is lost Solution: replicate all writes to Kafka! Machine dies ⇒ restart on another machine Restore key-value store from changelog Changelog compaction in the background
  • 13. Samza Jobs on Hadoop 2.0 (YARN) Samza App Master Node Manager Kafka Broker Machine 1 Machine 2 Samza TaskRunner: Partition 1 Node Manager Kafka Broker aStreamTask:process() Samza TaskRunner: Partition 2 aStreamTask:process() Machine 3 Node Manager Kafka Broker Samza TaskRunner: Partition 3 aStreamTask:process()
  • 14. Monitoring Samza: Metrics and More Samza JMX metrics jmxtrans OpenTSDB/HBase Grafana

Notas do Editor

  1. Hi everyone, this is your last chance to get out in case you have the wrong room and session. Okay, relax. Don’t be afraid if fraud prevention friends are the only words you’re familiar with. My goal today is to make this accessible to all of you.
  2. My name is Edi Bice and I’m a senior data scientist at eBay Enterprise, a former division of eBay and different from eBay Marketplaces. Together with Innotrack, we are the world’s largest omnichannel commerce provider, and a partner to the world’s most iconic brands. We provide everything from retail order management, payments, fraud and tax to fulfillment and transportation, store fulfillment, customer service etc.
  3. Older Data – Looking back a lot further in time. Older data is not effective excuse – home for the holidays? Profiling infrequent behavior requires longer timeframes for sufficient data points. Wider Data – Exploiting all available data. Still, how wide can that be!? How wide is Edi Bice? 7 characters, 14 bytes worth? Infinite! Compressed: my purchase history, web log, FB posts I liked, tweets I retweeted, things mentioned there, and so on, not very compressed. Richer Data – Social/public data: people, places, interests graph – who is friends/related with/to whom, lived/studied/travelled where. Connected order data: who shipped to whom, where; tender, email, devices, IP addresses. Unstructured data: FB posts, tweets, Pinterest pins. Faster Data – Clickstream data: browsing, researching specific product, reading reviews, typing characteristics, etc.
  4. Web scale capable – what is web-scale? Google, Facebook, … But it’s more about the ability to scale up, or down, than the size itself. “By 2017, Web-scale IT … in 50 percent of global enterprises … according to Gartner, Inc.” Low Latency – Fault Tolerance – expected failure, redundancy, distribution of responsibilities
  5. … we are all normal in our own abnormal special ways. So we can’t define one average and apply to all customers – we need one for each customer. But not just that, we need averages over periods of time, at same or similar merchants, and so on. Using transaction number/amount velocity average to detect likely aberrant behavior. Typical implementation: Why sliding window? Can’t store/pull entire, unlimited history in/from DB – restrict to recent history Why single SQL DB? Who can afford Oracle RAC?! Sliding window restricts calculation of normal customer profile. How often do you shop at X? Using card Y? Shipping to your friend Z? Oops, we let it slide out of the sliding window!! How can we profile properly, efficiently, and quickly? At large scale?
  6. Assembly line processing! The Venetian Arsenal, dating to about 1104, operated similar to a production line. Ships moved down a canal and were fitted by the various shops they passed. Ransom Olds patented the assembly line concept, which he put to work in his Olds Motor Vehicle Company factory in 1901. 1913 Henry Ford perfected the assembly line by installing driven conveyor belts that could produce a Model T in 93 minutes.
  7. Streaming Analytics is assembly line processing! Order comes through, we look up CTxns for Edi, find it’s 0, increment by 1 and store back. Look up CustAvgAmt for Edi, find it’s 0, calculate the new average and store. Another order comes through, we look up … and so on. We don’t store and access all of Edi’s transactions in order to compute total number of transactions and average transaction amount. “summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible.” Average, standard deviation, skewness etc.
  8. 1964 Doug McIlroy internal Bell Labs memo: “We should have some ways of connecting programs like [a] garden hose – screw in another segment when it becomes necessary to massage data in another way.” Unix philosophy 1978: 1. Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features.” 2. Expect the output of every program to become the input to another, as yet unknown, program. A Samza job is like an assembly line worker/robot. Except that it can pull work items from many (input) conveyor belts (streams), and place completed work in many (output) conveyor belts (streams). Samza tasks are like worker clones – if one task per job is enough good, otherwise add some more (up to total number of stream partitions).
  9. Kafka – read conveyor belts. Kafka ecosystem at LinkedIn is sent over 800 billion messages per day which amounts to over 175 terabytes of data. To handle all these messages, LinkedIn runs over 1100 Kafka brokers/nodes. How does Kafka do this? Partitioned topic/stream (how could Henry Ford produce > 1/93 Model T per min?) Sequential disk IO (append only no updates like most databases) Multiple producers, multiple consumers per topic Distributed (team work – different Kafka nodes handling different partitions) Track your own work (no manager bottleneck)
  10. Live orders are streamed to “orders” Kafka topic/stream (conveyor belt). OrderManifold task partitions and routes orders to several Kafka topics, each keyed by a specific order field, ensuring worker sees all orders for a given email for example. OrderByBillEmail task receives, calculates and stores all respective email statistics. JoinBillEmailFeats receives input from two streams, “orders” and “order_by_billemail_feats”, joins them and sends them down the assembly line. ReviewedOrderRouter reads from “reviewed_orders” stream and sends fraudulent ones to “reviewed_fraud” stream and otherwise to “reviewed_notfraud” stream. FraudOrderManifold is similar to OrderManifold. FraudByBillEmailRisk calculates risk statistics for email and email domain, and so on. At the end of the assembly line the car looks like this.
  11. Graphical view of the sample JSON output. Each + node is expandable and looks similar to billaddr_feats. Each + node is the product of one Samza job/task (assembly line worker) – specialization of labor principle. Each + node is appended to the work-in-progress car by a “join”er worker. Time spent by each task (worker) is tracked to identify slow tasks (workers) – join_billaddr_feats_beg/end in ms (2!)
  12. Local state – instead of a central DB, each Samza job (task) uses a local (Rocks)DB. RocksDB, now open-source, created at Facebook off of LevelDB created at Google. “RocksDB can be used by applications that need low latency database accesses.” Kafka changelog – update RocksDB, write change to Kafka stream, Kafka auto compacts (throws away all but latest value for a given variable) changelog stream. Machine dies, restart on another machine. Who, how?
  13. Enter Hadoop 2.0 YARN (Yet Another Resource Negotiator). Launching a Samza job The Samza client talks to the YARN RM when it wants to start a new Samza job. The YARN RM talks to a YARN NM to space on the cluster for Samza’s App Master. Once the NM allocates space, it starts the Samza AM. The AM asks the YARN RM for 1+ YARN containers to run SamzaContainers (tasks) Once resources for 4 have been allocated, the NMs start the Samza containers. If any fail, YARN starts a new one, on same, if possible, or a different machine. Great, so it’s fault tolerant. Is it fail/fool proof?!
  14. It’s a complex stack! Lot’s can go wrong - monitoring and alerting are key to proactive maintenance and quick troubleshooting in case things do go wrong. metrics that allow you to see how many messages have been processed and sent the current offset in the input stream partition, and other details. metrics about the JVM (heap size, garbage collection information, threads etc.) internal metrics of the Kafka producers and consumers, and more. custom metrics Low-latency, web-scale metrics monitoring ;-) Samza JMX metrics Jmxtrans OpenTSDB/Hbase Grafana
  15. Questions? Now, if we have time Right after the session Over drinks later on Shoot me an email