SlideShare uma empresa Scribd logo
1 de 47
Fuzzy Table  Distributed Fuzzy Matching Database Ed Kohlwey [email_address]
Session Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object]
Fuzzy Matching?
What is Fuzzy Matching? *Euclidean Distance in this example These images are very similar, but obliviously not the same. To find image #2 given image #1, some sort of fuzzy matching technique needs to be used Images from Flickr; Licensed under Creative Commons  http://www.flickr.com/photos/mdpettitt/455527136/sizes/l/in/photostream/ http://www.flickr.com/photos/mdpettitt/455539917/sizes/l/in/photostream/ Distance Function* 31.46 Feature Extraction  & Normalization Feature Extraction  & Normalization 1 2 Start with some multimedia image/voice/audio/video/etc Create a Vector or Matrix of doubles
How Is Fuzzy Matching Being Used Today?
Why Do We Care? ,[object Object],[object Object],[object Object]
Biometrics – A Fuzzy Matching Problem Same Person? Lifted From A Crime Scene Law Enforcement Database
Biometrics – Example *Euclidean Distance in this example Distance Function* 2.41 Feature Extraction  & Normalization Feature Extraction  & Normalization 1 2 Query Biometrics Database Create a Vector or Matrix of doubles
The Big Data Problem
Growth of Multimedia Databases ,[object Object],[object Object],http://techcrunch.com/2009/04/07/who-has-the-most-photos-of-them-all-hint-it-is-not-facebook/ http://ksudigg.wetpaint.com/page/YouTube+Statistics http://techcrunch.com/2009/04/28/as-youtube-passes-a-billion-unique-us-viewers-hulu-rushes-into-third-place/ ,[object Object],[object Object]
Growth of Biometric Databases ,[object Object],[object Object],[object Object],*  US-VISIT: The world’s largest biometric application.  William Graves. **  http://www.fbi.gov/hq/cjisd/iafis/iafis_facts.htm ***  http://www.business-standard.com/india/news/national-population-register-to-start-biometrics-data-collectiondec/399135/ ****  http://www.findbiometrics.com/articles/i/5220/ ***** http://www.alltrustnetworks.com/News/6Million/tabid/378/Default.aspx ,[object Object],[object Object],[object Object],US-VISIT
Biometric Databases are a Big (Data) Problem ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],= 1 – 2 PB ,[object Object],= 2 – 27 TB
A Scalable Solution
Hadoop and Multimedia Databases ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Fuzzy Table: Large-scale, Low Latency, Fuzzy Matching Database ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Fuzzy Table Architecture
Bulk Binning and Real-time Classification * Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot
Fuzzy Table: Bulk Data Processing Component ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],*Efficient Search and Retrieval in Biometric Databases by Amit Mhatre, Srinivas Palla, Sharat Chikkerur and Venu Govindaraju * Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot
Procedure
Fuzzy Table: Data Storage and Bins ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Low Latency Component ,[object Object],[object Object],[object Object],[object Object],[object Object]
Fuzzy Table Query
Fuzzy Table Query
Fuzzy Table Query
Fuzzy Table Query
Fuzzy Table Query
Fuzzy Table Query
Fuzzy Table Query
Fuzzy Table Query
Fuzzy Table: Optimizations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Performance
Performance and Scalability Testing On EC2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Average Query Times # Of Data Servers Time To Respond (ms)
Average Query Times # Of Data Servers Time To Respond (ms) Linear Scalability to ~ 7 Nodes Lower limit due to I/O latencies
Longest Query Times # Of Data Servers Time To Respond (ms) Frequent Namenode access + large  number of DFS clients begins to erode performance
Shortest Query Times # Of Data Servers Time To Respond (ms) ~500 ms
EC2 Results Discussion ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Performance and Scalability (Local) ,[object Object],[object Object],[object Object]
Caching Performance # Threads Polling The Master Server Average Response Time (ns) Major discrepency, grows with load
Conclusion & Future Work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Contact Information – Cloud Computing Team Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)617-3523  [email_address] Jesse Yates Consultant @jason_trost @ekohlwey @jesse_yates @mikeridley Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)543-4611 [email_address] Michael Ridley Associate Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)543-4400 [email_address] Jason Trost Associate Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000  [email_address] Edmund Kohlwey Senior Consultant Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000  [email_address] Robert Gordon Associate
Thanks ,[object Object],[object Object]
Questions
Questions?
Appendix
Technologies Used ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Fuzzy Table: Low Latency Fuzzy Matching Component Details ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Mais conteúdo relacionado

Mais procurados

Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
yhadoop
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
Alluxio, Inc.
 

Mais procurados (20)

Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
HDFS Issues
HDFS IssuesHDFS Issues
HDFS Issues
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Mumak
MumakMumak
Mumak
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 

Semelhante a Nov 2010 HUG: Fuzzy Table - B.A.H

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 

Semelhante a Nov 2010 HUG: Fuzzy Table - B.A.H (20)

Hadoop World 2010 - BAH - Fuzzy Table
Hadoop World 2010 - BAH - Fuzzy TableHadoop World 2010 - BAH - Fuzzy Table
Hadoop World 2010 - BAH - Fuzzy Table
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Bigdata
BigdataBigdata
Bigdata
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Bigdata overview
Bigdata overviewBigdata overview
Bigdata overview
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
Data Deduplication: Venti and its improvements
Data Deduplication: Venti and its improvementsData Deduplication: Venti and its improvements
Data Deduplication: Venti and its improvements
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Big data
Big dataBig data
Big data
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
 

Mais de Yahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 

Mais de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Nov 2010 HUG: Fuzzy Table - B.A.H

  • 1. Fuzzy Table Distributed Fuzzy Matching Database Ed Kohlwey [email_address]
  • 2.
  • 4. What is Fuzzy Matching? *Euclidean Distance in this example These images are very similar, but obliviously not the same. To find image #2 given image #1, some sort of fuzzy matching technique needs to be used Images from Flickr; Licensed under Creative Commons http://www.flickr.com/photos/mdpettitt/455527136/sizes/l/in/photostream/ http://www.flickr.com/photos/mdpettitt/455539917/sizes/l/in/photostream/ Distance Function* 31.46 Feature Extraction & Normalization Feature Extraction & Normalization 1 2 Start with some multimedia image/voice/audio/video/etc Create a Vector or Matrix of doubles
  • 5. How Is Fuzzy Matching Being Used Today?
  • 6.
  • 7. Biometrics – A Fuzzy Matching Problem Same Person? Lifted From A Crime Scene Law Enforcement Database
  • 8. Biometrics – Example *Euclidean Distance in this example Distance Function* 2.41 Feature Extraction & Normalization Feature Extraction & Normalization 1 2 Query Biometrics Database Create a Vector or Matrix of doubles
  • 9. The Big Data Problem
  • 10.
  • 11.
  • 12.
  • 14.
  • 15.
  • 17. Bulk Binning and Real-time Classification * Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot
  • 18.
  • 20.
  • 21.
  • 30.
  • 32.
  • 33. Average Query Times # Of Data Servers Time To Respond (ms)
  • 34. Average Query Times # Of Data Servers Time To Respond (ms) Linear Scalability to ~ 7 Nodes Lower limit due to I/O latencies
  • 35. Longest Query Times # Of Data Servers Time To Respond (ms) Frequent Namenode access + large number of DFS clients begins to erode performance
  • 36. Shortest Query Times # Of Data Servers Time To Respond (ms) ~500 ms
  • 37.
  • 38.
  • 39. Caching Performance # Threads Polling The Master Server Average Response Time (ns) Major discrepency, grows with load
  • 40.
  • 41. Contact Information – Cloud Computing Team Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)617-3523 [email_address] Jesse Yates Consultant @jason_trost @ekohlwey @jesse_yates @mikeridley Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)543-4611 [email_address] Michael Ridley Associate Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)543-4400 [email_address] Jason Trost Associate Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000 [email_address] Edmund Kohlwey Senior Consultant Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000 [email_address] Robert Gordon Associate
  • 42.
  • 46.
  • 47.

Notas do Editor

  1. Fuzzytable is a distributed, real-time database for biometrics and multimedia
  2. What is fuzzy matching How does it relate to hadoop and big data Our solution and how it works Performance testing and results And finally take questions
  3. Pause
  4. An operation that determines how similar two objects are to each other Lots of distance measures MPEG 7 Standard Color histograms Edge histograms Most frequent color How are colors distributed
  5. Shazam – Music searching service Google Goggles – Image search service from Google Face.com – Automatic image tagging for facebook
  6. We’re strategy + technology consulting Biggest client is US government Government has a lot of fuzzy data
  7. Example from the security market Evidence from a crime scene probably won’t perfectly match the record in your database
  8. It turns out you can perform the same type of analysis on biometric data
  9. Fuzzy data is growing in the private sector A few data points Assuming an image is around 300 k, Facebook will have about 8 Exabytes of images
  10. Fuzzy data is also growing in the public sector Governments are applying biometric databases everywhere Social services Border security Visas Criminal investigations
  11. These databases are big Must be fast Must support complex online operations First figure is estimate of raw data storage Second is estimate of metadata and template storage
  12. HDFS Opens the doors to storing more and more raw images and at higher resolutions MapReduce Easy to test and deploy new algorithms against all data at scale Map Reduce can be used for batched searching where latency doesn’t matter, but what about low latency searching…?
  13. Things that differentiate this solution Scales linearly Real-time retrieval Highly parallel Cheap
  14. Overall architecture architecture Built on hadoop core components DON’T break down beyond two top level components
  15. Bulk processing organizes data, constraints search space Real time retrieval queries database, presents response
  16. Clustering Produces bins Produces bin metadata Records are stored in HDFS Map/Reduce used for bulk processing tasks
  17. Entire pipeline Shows complexity of whole procedure Blue boxes in bulk processing area are all implemented in map/reduce
  18. Use HDFS structure to express database organization Focus on simplicity in implementation; Chunks are limited to block size, makes determining data locality easy Reliance on HDFS load balancing to distribute data Preserving data local execution
  19. Draw audience attention to arrows The low latency component consists of three main parts Client – submits queries for Keys and get back {Key, Value} pairs Master Server – serve metadata about which Data Servers host which bins Data Servers – Actually perform fuzzy matching searches
  20. First, “query record” is submitted to master
  21. Master determines which bins contain similar records
  22. Master determines which servers host the relevant bins
  23. Master returns bin/server metadata
  24. Client queries servers which host relevant data (in this case, data in the red bin)
  25. Data servers search their chunks
  26. Data servers return results in real time. NEXT: Optimizations
  27. Optimizations Metadata caching; db structure is expressed in HDFS; this is a bottleneck Replication and speculative execution Data locality
  28. EC2 Used for performance testing 1 tb of input data Ran series of tests over low-latency component
  29. This shows results Pause before next slide
  30. Application performance scales linearly to a point I/O inefficiencies place lower bound on scalability
  31. More evidence of namenode issues
  32. Very short query times are achievable
  33. Summary: application scales well Query 1TB of images in 500 ms possible Simple I/O optimizations can make this system faster + more robust
  34. This is a difficult problem We presented a scalable solution Provides look at innovative real-time applications for Hadoop ecosystem
  35. This is everyone who worked on the project
  36. Special thanks Lalit, former team member Brandyn White – UMD computer vision researcher