DISQL is a distributed programming framework used at Baidu to perform statistical analysis and data extraction from logs to generate reports. It provides SQL-like operators that can be combined and encapsulates distributed algorithms with automatic code generation. DISQL includes a web-based log statistical platform (LSP), programming interfaces (DQuery) in languages like PHP, and different editing modes from simple to complex. It is driven by use cases rather than completeness and takes advantage of host languages while remaining open for extension of algorithms.
7. Problems statistical analysis of features features of web pages, web sites, ads, user preferences, etc in order to provide data for data mining and machine learning 7
10. A Platform named Log Statistical Platform, a.k.a. LSP web-based convenient for secondary development convenient for task/data/rights management 10
11. A Programming Framework named DIstributed SQL, a.k.a. DISQL provide SQL-like operators which can be combined arbitrarily encapsulate distributed algorithms automatic code generation 11
12. Application Programming Interfaces named Distributed Query, a.k.a. DQuery DSL-style APIsembedded in well-known programming languages PHP so far, C++/Python,… in the future using method chainingtechnique to provide fluent interface data-flow in the form of DAGcomposed by chains of methods 12
18. LSP Architecture 18 data presentation & monitoring third party apps data access layer data management layer computing layer storage systems computing systems
21. Example 2 given a log of query and ad shows extract site field from url field filter sites with regex calculate the amount of query and ad shows per site output in JSON format 21
24. Use Case Driven VS Completeness Our Solution Problem Problem Problem Problem 24
25. Internal DSL VS External DSL take advantage of: parsers, libraries and VMs of the host languages users and communities language features different from Pig, Hive, Sawzall, etc 25
26. Open/Closed Principles “open for extension, closed for modification” open for single machine algorithms, closed for distributed algorithms also different from Pig, Hive, Sawzall, … 26