SlideShare uma empresa Scribd logo
1 de 10
Baixar para ler offline
Search data store for the world's largest
                            biometric identity system


                    Regunath Balasubramanian         Shashikant Soni
                      regunathb@gmail.com      soni.shashikant@gmail.com
                       twitter @regunathb




CONFIDENTIAL: For limited circulation only                                 Slide 1
India
● 1.2 billion residents
   ● 640,000 villages, ~60% lives under $2/day
   ● ~75% literacy, <3% pays Income Tax, <20% banking
   ● ~800 million mobile, ~200-300 mn migrant workers

● Govt. spends about $25-40B on direct subsidies
   ● Residents have no standard identity document
   ● Most programs plagued with ghost and multiple identities causing
     leakage of 30-40%




                                                                        Slide 2
Aadhaar
● Create a common ‘national identity’ for every ‘resident’
   ●Biometric backed identity to eliminate duplicates
   ●‘Verifiable online identity’ for portability
● Applications ecosystem using open APIs
   ●Aadhaar enabled bank account and payment platform
   ●Aadhaar enabled electronics, paperless KYC (Know Your
     Customer)




                                                             Slide 3
Search Requirements
● Multi-attribute query like:
   name contains ‘regunath’ AND city = ‘bangalore’ AND
   address contains ‘J P Nagar’ AND YearOfBirth = ……


● Search 1.2B resident data with photo, history
   ●35Kb - Average record size
● Response times in milliseconds
● Open scale out


                                                         Slide 4
Why MongoDB
● Auto-sharding
● Replication
● Failover
   … Essentially an AP (slaveOk) data store in CAP parlance

● Evolving schema
● Map-Reduce for analysis
● Full text search
   ●Compound (or) multi-keys


                                                              Slide 5
Design

               { _id:123456789, name: ‘abcde’, year:1980, ….. }
    MongoDB         2

                                             Search API                                  Client App
                                                                  Name=‘abcde’
    Solr            1
                                                                  Address=‘some place’
  Indexes     Name: ‘abcde’                                       Year= 1980
              Address: ‘some place’
              year: 1980



● Read/Search
   ●Sharded Solr indexes for search
   ●Keyed document read from MongoDB
● Write
   ●Eventual consistency (across data sources) driven by
    application
   ●Composite MongodDB-Solr app persistence handler                                                   Slide 6
Implementation and Deployment
   ● Start - 4M records in 2 shards
   Current - 250M records in 8 shards ( 8 x ~2 TB x 3 replicas)
   ● Performance , Reliability & Durability
      ●SlaveOk
      ●getLastError, Write Concern: availability vs durability
           j = journaling
           w = nodes-to-write
   ● Replica-sets / Shards – how?
            RS 1                RS 1              RS 1
            Rs 2                                  RS 2              RS 2

Primary
                     Config 1          Config 2          Config 3
Secondary

Arbiter               Router           Router            Router
                                                                           Slide 7
Monitoring and Troubleshooting
● Monitoring tools evaluated
   ●MMS
   ●munin
● Manual approach - daily ritual
   ●RS, DB, config, router - health and stats
● Problem analysis stats
   ●mongostat, iostat, currentOps, logs
   ●Client connections
● Stats for storage, shards addition
   ●Data file size
   ●Shard data distribution
   ●Replication
                                                Slide 8
Key Learnings on MongoDB
● Indexing 32 fields
   ●Compound indexes
   ●Multi-keys indexes
        {…"indexes" : [{ "email":"john.doe@email.com", "phone":"123456789“ }] }
        db.coll.find ({ "indexes.email" : "john.doe@email.com" })
   ●Indexes use b-tree
   ●Many fields to index
   ●Performs well upto 1-2M documents
   ●Best if index fits in memory
● Data replication, RS failover
   ●Rollback when RS goes out of sync
        Manual restore (physical data copy)
        Restarting a very stale node
                                                                            Slide 9
Questions?



                    Regunath Balasubramanian               Shashikant Soni
                      regunathb@gmail.com            soni.shashikant@gmail.com
                       twitter @regunathb




CONFIDENTIAL: For limited circulation only                                       Slide 10

Mais conteúdo relacionado

Destaque

practical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome thempractical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome themsaipriyadonthula
 
Hadoop at aadhaar
Hadoop at aadhaarHadoop at aadhaar
Hadoop at aadhaarRegunath B
 
Building the Flipkart phantom
Building the Flipkart phantomBuilding the Flipkart phantom
Building the Flipkart phantomRegunath B
 
Facebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streamsFacebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streamsRegunath B
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantageRegunath B
 
Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Ali Raw
 

Destaque (7)

practical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome thempractical risks in aadhaar project and measures to overcome them
practical risks in aadhaar project and measures to overcome them
 
Uid
UidUid
Uid
 
Hadoop at aadhaar
Hadoop at aadhaarHadoop at aadhaar
Hadoop at aadhaar
 
Building the Flipkart phantom
Building the Flipkart phantomBuilding the Flipkart phantom
Building the Flipkart phantom
 
Facebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streamsFacebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streams
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantage
 
Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)Authentication(pswrd,token,certificate,biometric)
Authentication(pswrd,token,certificate,biometric)
 

Building a Search Data store for the world's largest biometric identity system

  • 1. Search data store for the world's largest biometric identity system Regunath Balasubramanian Shashikant Soni regunathb@gmail.com soni.shashikant@gmail.com twitter @regunathb CONFIDENTIAL: For limited circulation only Slide 1
  • 2. India ● 1.2 billion residents ● 640,000 villages, ~60% lives under $2/day ● ~75% literacy, <3% pays Income Tax, <20% banking ● ~800 million mobile, ~200-300 mn migrant workers ● Govt. spends about $25-40B on direct subsidies ● Residents have no standard identity document ● Most programs plagued with ghost and multiple identities causing leakage of 30-40% Slide 2
  • 3. Aadhaar ● Create a common ‘national identity’ for every ‘resident’ ●Biometric backed identity to eliminate duplicates ●‘Verifiable online identity’ for portability ● Applications ecosystem using open APIs ●Aadhaar enabled bank account and payment platform ●Aadhaar enabled electronics, paperless KYC (Know Your Customer) Slide 3
  • 4. Search Requirements ● Multi-attribute query like: name contains ‘regunath’ AND city = ‘bangalore’ AND address contains ‘J P Nagar’ AND YearOfBirth = …… ● Search 1.2B resident data with photo, history ●35Kb - Average record size ● Response times in milliseconds ● Open scale out Slide 4
  • 5. Why MongoDB ● Auto-sharding ● Replication ● Failover … Essentially an AP (slaveOk) data store in CAP parlance ● Evolving schema ● Map-Reduce for analysis ● Full text search ●Compound (or) multi-keys Slide 5
  • 6. Design { _id:123456789, name: ‘abcde’, year:1980, ….. } MongoDB 2 Search API Client App Name=‘abcde’ Solr 1 Address=‘some place’ Indexes Name: ‘abcde’ Year= 1980 Address: ‘some place’ year: 1980 ● Read/Search ●Sharded Solr indexes for search ●Keyed document read from MongoDB ● Write ●Eventual consistency (across data sources) driven by application ●Composite MongodDB-Solr app persistence handler Slide 6
  • 7. Implementation and Deployment ● Start - 4M records in 2 shards Current - 250M records in 8 shards ( 8 x ~2 TB x 3 replicas) ● Performance , Reliability & Durability ●SlaveOk ●getLastError, Write Concern: availability vs durability j = journaling w = nodes-to-write ● Replica-sets / Shards – how? RS 1 RS 1 RS 1 Rs 2 RS 2 RS 2 Primary Config 1 Config 2 Config 3 Secondary Arbiter Router Router Router Slide 7
  • 8. Monitoring and Troubleshooting ● Monitoring tools evaluated ●MMS ●munin ● Manual approach - daily ritual ●RS, DB, config, router - health and stats ● Problem analysis stats ●mongostat, iostat, currentOps, logs ●Client connections ● Stats for storage, shards addition ●Data file size ●Shard data distribution ●Replication Slide 8
  • 9. Key Learnings on MongoDB ● Indexing 32 fields ●Compound indexes ●Multi-keys indexes {…"indexes" : [{ "email":"john.doe@email.com", "phone":"123456789“ }] } db.coll.find ({ "indexes.email" : "john.doe@email.com" }) ●Indexes use b-tree ●Many fields to index ●Performs well upto 1-2M documents ●Best if index fits in memory ● Data replication, RS failover ●Rollback when RS goes out of sync Manual restore (physical data copy) Restarting a very stale node Slide 9
  • 10. Questions? Regunath Balasubramanian Shashikant Soni regunathb@gmail.com soni.shashikant@gmail.com twitter @regunathb CONFIDENTIAL: For limited circulation only Slide 10