13. Pacman Started in 2006 in Bangalore Process large feeds, millions of records in few hours Multi-Tenant Reliability, Operability Use Hadoop M/R, one record is unit of processing Workflow semantics over Hadoop Workflow defined by DAG Each node result is stored in HDFS ‘Channels’ Feeds processing oriented API, abstracting M/R High Availability, Cross-colo replication HDFS data 3 Yahoo! Inc
14. Design Notification Asynchronous processing One Job for each WF node State in DB Feed copied on the Grid Reporting service exposes metrics and logs 4 Yahoo! Inc
16. The small feeds problem More and more small feeds on boarded (NPC, OMG, Green…) Overhead of Pacman is high (Hadoop, DB…) Too many small files on HDFS Solution : Process nodes of Workflow in WebServer Farm Lack of Isolation Between executions Native libraries management Operability issues (provisioning,…) 6 Yahoo! Inc
17. Pepper requirements Be able to support all properties : News, Finance, Travel, … Scalable (millions of feeds a day), Elastic Isolation, Multiple Native Libraries versions Low overhead (<5s) Compatible with Pacman API Reuse Pacman code/infrastructure as most as possible 7 Yahoo! Inc
18. Pepper Servlet Model Synchronous in-memory execution of the workflow (very fast) No use of HDFS Share Pacman API and infrastructure Hadoop Reporting, Deployment… Cloud like qualities Elastic, Scalable Isolation 8 Yahoo! Inc
19. Design Embedded Jetty server runs in Map task, registers with ZooKeeper 1 Hadoop job = 1 Map task = 1 Web Server = 1 WebApp = 1 Workflow Proxy Router receives incoming requests, looks up ZooKeeper & redirects to appropriate Web Server 9 Yahoo! Inc
21. Production numbers Pacman : 20+ solutions (Autos, Real Estate, Deals…) 150,000 feeds 250 requests/h 200 millions listings processed/week Pepper : News, Finance, NPC 600,000 feeds 10,000 requests/h… for now 20 Hadoop slave cluster (x2 colos) 11 Yahoo! Inc
22. Cover the whole spectrum Clever switch between the 2 systems Choice can be done upfront ‘Sticky’ feeds go to Pacman Size > 2MB go to Pacman Failed feeds in Pepper are redirected to Pacman OutOfMemory TimeOut 12 Yahoo! Inc
23. Example of processing Validation against schema Filtering (Security), Image resizing Send images to edge serving Reformat to common model Simple (in-line) enrichments Categorization Geocoding Entity Recognition Clustering 13 Yahoo! Inc
24. Conclusion One common platform (Deployment, Reporting…) Covers the whole spectrum of feeds Share same Hadoop cluster Very generic concepts Pacman : Workflow engine Pepper : Serving cloud on top of Hadoop 14 Yahoo! Inc
25. Pepper future work On-demand allocation of servers Async NIO between Proxy Router & Map Web Engine to increase scalability Improving distribution of requests across web servers Follow Hadoop roadmap 15 Yahoo! Inc