2. What Do You Do if Your Data is to Big for a Database? Give up and invoke sampling. Buy a proprietary system and ask for a raise. Begin to build a custom system and explain why it is not yet done. Use Hadoop. Use an alternative large data cloud (e.g. Sector)
3. Basic Idea Turn it into a pleasantly parallel problem. Use a large data cloud to manage and prepare the data. Use a Map/Bucket function to split the job. Run R on each piece using Reduce/UDF or streams. Use PMML multiple models to glue the pieces together.
4. Why Listen? This approach allows you to scale R relatively easily to hundreds of TB to PB. The approach is easy. (A plus: it may look hard to your colleagues, boss or clients.) There is at least an order of magnitude of performance to be gained with the right design.
6. The Google Data Stack The Google File System (2003) MapReduce: Simplified Data Processing… (2004) BigTable: A Distributed Storage System… (2006) 6
7. Map-Reduce Example Input is file with one document per record User specifies map function key = document URL Value = terms that document contains “it”, 1“was”, 1“the”, 1“best”, 1 (“doc cdickens”,“it was the best of times”) map
8. Example (cont’d) MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) The user-defined reduce function combines all the values associated with the same key key = “it”values = 1, 1 “it”, 2“was”, 2“best”, 1“worst”, 1 key = “was”values = 1, 1 reduce key = “best”values = 1 key = “worst”values = 1
10. Google’s Large Data Cloud Compute Services Data Services Storage Services 10 Applications Google’s MapReduce Google’s BigTable Google File System (GFS) Google’s Stack
11. Hadoop’s Large Data Cloud Applications Compute Services 11 Hadoop’sMapReduce Data Services NoSQL Databases Hadoop Distributed File System (HDFS) Storage Services Hadoop’s Stack
13. Sector’s Large Data Cloud 13 Applications Compute Services Sphere’s UDFs Data Services Sector’s Distributed File System (SDFS) Storage Services Routing & Transport Services UDP-based Data Transport Protocol (UDT) Sector’s Stack
14. Apply User Defined Functions (UDF) to Files in Storage Cloud map shuffle /reduce 14 UDF UDF
15. Folklore MapReduce is great. But sometimes it is easier to use UDFs or other parallel programming frameworks for large data clouds. And often it is easier to use Hadoop streams, Sector streams, etc.
19. MalStone Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.
21. Problems Deploying Models Models are deployed in proprietary formats Models are application dependent Models are system dependent Models are architecture dependant Time required to deploy models and to integrate models with other applications can be long.
22. Predictive ModelMarkup Language (PMML) Based on XML Benefits of PMML Open standard for Data Mining & Statistical Models Not concerned with the process of creating a model Provides independence from application, platform, and operating system Simplifies use of data mining models by other applications (consumers of data mining models)
23. PMML Document Components Data dictionary Mining schema Transformation Dictionary Multiple models, including segments and ensembles. Model verification, … Univariate Statistics (ModelStats) Optional Extensions
33. Step 2: Invoke R on each segment/bucket and build PMML model Step 1: Preprocess data using MapReduce or UDF models Step 3: Gather the models together to form a multiple model PMML file
34. Step 1: Preprocess data using MapReduce or UDF Step 2: Build separate model in each segment using R Step 1: Preprocess data using MapReduce or UDF Step 2: Score data in each segment using R
35. Sawmill Summary Use HadoopMapReduce or Sector UDFsto preprocess the data Use HadoopMap or Sector buckets to segment the data to gain parallelism Build separate statistical model for each segment using R & Hadoop / Sector Streams Use multiple models specification in PMML version 4.0 to specify segmentation Example: use Hadoop Map function to send all data for each web site to different segment (on different processor)
36.
37. Using R to score 2 segments concatenated together = 60 minutes