Sawmill - Integrating R and Large Data Clouds

SawmillSome Lessons Learned Running R in Large Data Clouds Robert Grossman Open Data Group

What Do You Do if Your Data is to Big for a Database? Give up and invoke sampling. Buy a proprietary system and ask for a raise. Begin to build a custom system and explain why it is not yet done. Use Hadoop. Use an alternative large data cloud (e.g. Sector)

Basic Idea Turn it into a pleasantly parallel problem. Use a large data cloud to manage and prepare the data. Use a Map/Bucket function to split the job. Run R on each piece using Reduce/UDF or streams. Use PMML multiple models to glue the pieces together.

Why Listen? This approach allows you to scale R relatively easily to hundreds of TB to PB. The approach is easy. (A plus: it may look hard to your colleagues, boss or clients.) There is at least an order of magnitude of performance to be gained with the right design.

Part 1. Stacks for Big Data 5

The Google Data Stack The Google File System (2003) MapReduce: Simplified Data Processing… (2004) BigTable: A Distributed Storage System… (2006) 6

Map-Reduce Example Input is file with one document per record User specifies map function key = document URL Value = terms that document contains “it”, 1“was”, 1“the”, 1“best”, 1 (“doc cdickens”,“it was the best of times”) map

Example (cont’d) MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) The user-defined reduce function combines all the values associated with the same key key = “it”values = 1, 1 “it”, 2“was”, 2“best”, 1“worst”, 1 key = “was”values = 1, 1 reduce key = “best”values = 1 key = “worst”values = 1

Applying MapReduce to the Data in Storage Cloud shuffle/reduce map 9

Google’s Large Data Cloud Compute Services Data Services Storage Services 10 Applications Google’s MapReduce Google’s BigTable Google File System (GFS) Google’s Stack

Hadoop’s Large Data Cloud Applications Compute Services 11 Hadoop’sMapReduce Data Services NoSQL Databases Hadoop Distributed File System (HDFS) Storage Services Hadoop’s Stack

Amazon Style Data Cloud Load Balancer Simple Queue Service 12 SDB EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances EC2 Instances S3 Storage Services

Sector’s Large Data Cloud 13 Applications Compute Services Sphere’s UDFs Data Services Sector’s Distributed File System (SDFS) Storage Services Routing & Transport Services UDP-based Data Transport Protocol (UDT) Sector’s Stack

Apply User Defined Functions (UDF) to Files in Storage Cloud map shuffle /reduce 14 UDF UDF

Folklore MapReduce is great. But sometimes it is easier to use UDFs or other parallel programming frameworks for large data clouds. And often it is easier to use Hadoop streams, Sector streams, etc.

Terasort Benchmark Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.

MalStone 18 entities sites dk-2 dk-1 dk time

MalStone Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.

Part 2Predictive Model Markup Language

Problems Deploying Models Models are deployed in proprietary formats Models are application dependent Models are system dependent Models are architecture dependant Time required to deploy models and to integrate models with other applications can be long.

Predictive ModelMarkup Language (PMML) Based on XML Benefits of PMML Open standard for Data Mining & Statistical Models Not concerned with the process of creating a model Provides independence from application, platform, and operating system Simplifies use of data mining models by other applications (consumers of data mining models)

PMML Document Components Data dictionary Mining schema Transformation Dictionary Multiple models, including segments and ensembles. Model verification, … Univariate Statistics (ModelStats) Optional Extensions

PMML Models polynomial regression logistic regression general regression center based clusters density based clusters ,[object Object]

Step 2: Invoke R on each segment/bucket and build PMML model Step 1: Preprocess data using MapReduce or UDF models Step 3: Gather the models together to form a multiple model PMML file

Sawmill - Integrating R and Large Data Clouds

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a Sawmill - Integrating R and Large Data Clouds

Semelhante a Sawmill - Integrating R and Large Data Clouds (20)

Mais de Robert Grossman

Mais de Robert Grossman (20)

Último

Último (20)

Sawmill - Integrating R and Large Data Clouds