O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
Agenda Challenges Why a Platform? Information Extraction Need, Impact Research / Evaluations Approach / Implementation Information Retrieval Need, Impact Research / Evaluations Approach / Implementation
Challenges Job Alerts Over 13 Million searches, 3 times a week Complex Matching: Multiple Filters, Boosts, Sorts Resdex 130K active users daily 470K searches daily Over 220 million resumes and growing. Job Search High QPS 112, 760K searches a day Near Real-time Indexing Jobs Refreshed 92 times daily Product Demands > 99.99% uptime, Stability, Scalability → User Experience Varied Functional Requirements (Complexity) NIRM, FN Suggestors, etc. Turnaround Time Over 17 applications and growing About a week to deploy / configure a new one
Why a Platform? Technical Challenges Code / Bug Duplication, Reusability Agility Product Requirements Drive Platform-Wide Features SOA, Integration, Business Logic Separation Comprehensive Documentation Scalability Development and QA Time/Cost Reduction Product Challenges Turnaround Time Business Logic Implementation = Configuration Miscellaneous Maintenance Cost Reduction Resource Optimization/Integration (...Cloud) Standardized Reporting / Health Monitoring
Information Extraction Data/Information Acquisition Structurize Raw Information Training based Models for Class Inference Functional Area Detection Rule based Extraction Nested Funnels/Filter Layers Regular Expressions Feedback Loop Wisdom of Crowd/Collective Intelligence SAP/SimCV: Capture User Response for Recommendations Continuous Quality Improvement
Information Retrieval Custom/Controllable Relevance/Matching Scalability of Search Large Volumes High Churn QPS Extraction/Acquisition Pipeline Pluggability Results Post Processing
IR: Use Cases/Impact NSE on Resdex India Relevance
IR: Use Cases/Impact Error Count the week Before: 91, week After: 1 Availability (Before: 97.71% - 99.44%, After: 99.99%) Performance Slow Queries ( 10 secs): < 0.2% Average Search Time: 0.55 secs QA Quote ”There is an overall decrease in the page download time for Resdex Search Results page. Incase the cache is cleared the page download time has decreased by 34% to 35%, while the page download time has drastically decreased, more than 73%, when checked without clearing cache.” NSE on Resdex FirstNaukri PM Quote ”Hardly any bugs considering the complexity of project. Search results are also coming @ speed of thought.”
IR: Platform Features High Availability, Stability, Performance Caching Adaptive Caching of Hit Attributes Caching of Expression Evaluations Pre-configurable Caching Query Filters Distributed Search Search over Sharded Indexes Auto Failover Auto Healing Search/Sort/Group Millions of results Complex expressions. Miscellaneous Status Reports, Performance Analytics Suggestive Garbage Collection Preload Indexes into RAM Ease of Deployment
IR: Platform Features Text Transformations Tokenization/Transformation/Tagging Controlled, Combinable Stemming Plural, Tenses, Noun-Forms, etc. [Relevance ] Inversion of Stem-roots Highlighting/Did You Mean/Query Expansion Phonetic Token Mapping/Augmentation Custom Word Mapping/Synonyms (iMatch) Linguistic Tagging PoS, Entity Extraction Match/Boost on Tags Sentence Detection Apply different analytics to different fields Context Sensitive Spelling Correction
IR: Platform Features Indexing Dynamic Rule Based Sharding, Distributed Search Multiple Data Source Type Support (Near-)Real Time Indexing, Search Generic Auxillary Index Format Fast Updation/Retrieval Realtime Per-User Filtering/Sorting Matching/Filtering Lucene Query Functionality Phrase, Proximity, Fuzzy, Wildcard FirstNaukri Suggestor Implementation
IR: Platform Features Scoring Fully Controlled, Customizable Relevance Scores More controllable/testable than Solr/Default Lucene Scoring Named Query Parts usable in Expressions Custom Scorer Variables Vector Space, Query Boost, LCS, Numwords Configurability, API SQL-like client wrapper Engine-App interactions look like SQL XML configurability
Road AheadIf you dont know where you are going, any road will get you there. - The Cheshire Cat, Alice in Wonderland.