SlideShare uma empresa Scribd logo
1 de 53
Fast lookups in R Joseph AdlerApril 13 2010
About me Relevant work Tasks Computer security research Credit risk modeling Pricing strategy Direct marketing Places American Express Johnson and Johnson DoubleClick VeriSign LinkedIn (now)
About me Books
Today’s talk What I wrote If you need to store a big lookup table, consider implementing the table using an environment. Environment objects are implemented using hash tables. Vectors and lists are not. This means that looking up a value with n element in a list can take O(n) time. Looking up the value in an environment object takes O(1) time on average
Today’s talk What I read after the book was printed Re: [R] beginner Q: hashtable or dictionary? From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk> Date: Mon 30 Jan 2006 - 18:37:00 ESTOn Sun, 29 Jan 2006, hadleywickham wrote:>> use a 'list': 
> 
> Is a list O(1) for setting and getting?Can you elaborate? R is a vector language, and normally you create a list in one pass, and you can retrieve multiple elements at once.Retrieving elements by name from a long vector (including a list) is very fast, as an internal hash table is used.Does the following item from ONEWS answer your question? Indexing a vector by a character vector was slow if both the vector and index were long (say 10,000).  Now hashing is used and the time should be linear in the longer of the lengths (but more memory is used). Indexing by number is O(1) except where replacement causes the list vector to be copied. There is always the option to use match() to convert to numeric indexing. -- Brian D. Ripley, Professor of Applied Statistics, University of Oxford Retrieving elements by name from a long vector (including a list) is very fast, as an internal hash table is used. Professor Brian D. Ripley
Today’s talk A short introduction to objects in R Looking up values in R How lookup tables are implemented in R Measuring lookup speed Optimizing lookup speed
Objects in R Everything in R is an object. Here are some examples of objects. Numeric Vector: > onehalf <- 1/2 > class(onehalf) [1] "numeric”
Objects in R Integer Vector: > four <- as.integer(4) > four [1] 4 > class(four) [1] "integer”
Objects in R Character vector: > zero <- "zero" > class(zero) [1] "character”
Objects in R Logical vector: > this.is.interesting <- FALSE > class(this.is.interesting) [1] "logical"
Objects in R Vectors can have multiple elements > one.to.five <- 1:5 > class(one.to.five) [1] "integer" > six.to.ten <- c(6, 7, 8, 9, 10) > class(six.to.ten) [1] "numeric"
Objects in R Lists contain heterogeneous collections of objects > stuff <- list(3.14, "hat", FALSE) > class(stuff) [1] "list"
Objects in R Functions are also objects in R: > f <- function(x, y) {+   x + y+ }> ffunction(x, y) { x + y}> class(f)[1] "function"
Objects in R Environments map names to objects. They are used within R itself to map variable names to objects. You can access these environment objects, or create your own. > one <- 1 > two <- 2 > three <- 3 > objects() [1] "one"   "three" "two"   > e <- .GlobalEnv > class(e) [1] "environment" > objects(e) [1] "e"     "one"   "three" "two"
Lookups You can look up an item in a vector, list, or array within R  Let’s define a vector:> a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)> a[1]  1  2  3  4  5  6  7  8  9 10 You can refer to elements by index:> a[3][1] 3
Lookups It's also possible to name elements in a vector, then refer to them by name: > b <- c(Joe=1, Bob=2, Jim=3)> b["Bob"]Bob This can be very convenient: you can use every vector in R as a table. You can access the name vector through the names function: > names(b)[1] "Joe" "Bob" "Jim"
Lookups Named vectors in R are implemented using two different arrays: a.20 names(a.20)
Lookups The name lookup algorithm works roughly like this: function(vector, name) {  for (i in 1:length(vector)) {     if (names(vector)[i] == name)        return vector[i]   }   return NA
Lookups Example: Look up a.20[“F”] a.20 names(a.20)
Lookups Example: Look up a.20[“F”] a.20 names(a.20) names(a.20)[1]
Lookups Example: Look up a.20[“F”] a.20 names(a.20) names(a.20)[2]
Lookups Example: Look up a.20[“F”] a.20 names(a.20) names(a.20)[4]
Lookups Example: Look up a.20[“F”] a.20 names(a.20) names(a.20)[4]
Lookups Example: Look up a.20[“F”] a.20 names(a.20) names(a.20)[5]
Lookups Example: Look up a.20[“F”] a.20 names(a.20) names(a.20)[5]
Lookups In vectors, Looking up a value by index takes a constant amount of time. Looking up a value by name (potentially) requires looking at every name in the names array. (This means that lookup times scale linearly with the number of items in the table.)
Lookups Environments store (and fetch) data using a different structure. They use hash tables. Hash tables rely on a hash function to map labels to indices.
Lookups Simple hash table implementation Example: store 15 ¾ for “Joe” Calculate h(“Joe”) Store 15 ¾ in thetable in slot h(“Joe”) h(“Joe”) = 4
Lookups If you carefully choose the size of the hash table and the hash function, you can store and lookup values in constant time (on average) in hash tables.
Measuring Lookup Speed In theory, looking up values in environments should be faster than looking up values in vectors. In practice, how much difference does this make? Let’s measure how much time it takes to look up values in vectors and environments, using different lookup methods
Measuring Lookup Speed Let's build a large, labeled vector for testing: labeled.array<- function(n) {a <- 1:nfrom <- “1234567890"to <- "ABCDEFGHIJ"for (i in 1:n) {names(a)[i] <- chartr(from, to, i)  }a } Here's an example of the output of this function:> a.20 <- labeled.array(20)> a.20A  B  C  D  E  F  G  H  I AJ AA AB AC AD AE AF AG AH AI BJ     1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
Measuring Lookup Speed Let's also create environment objects for testing:labeled.environment <- function(n) {e <- new.env(hash=TRUE, size=n)   from <- "1234567890”  to <- "ABCDEFGHIJ”  for (i in 1:n) {assign(x=chartr(from, to, i),      value=i, envir=e)  }e} Here’s an example of the output of this function: > e.20 <- labeled.environment(20) > e.20 <environment: 0x143756c>
Measuring Lookup Speed You can fetch values from an environment object with the get function > get("A",envir=e.20)[1] 1> get("BA",envir=e.20)[1] 20 You can also fetch values from an environment with the double bracket operator > e.20[["A"]][1] 1> e.20[["BA"]][1] 20
Measuring Lookup Speed Creating examples for testingarrays <- list()for (i in 10:15) {arrays[[as.character(2 ** i)]] <-labeled.array(2 ** i)}environments <- list()for (i in 10:15) {environments[[as.character(2 ** i)]] <-labeled.environment(2 ** i)}
Measuring Lookup Speed Using the test function:test_expressions("first element, by index:",function(d,l,r) {s<- 0  for (v in 1:r) {s<- s + d[1]    }},arrays, 1024) Output:first element, by index:1024  2048  4096  8192 16384 327680.010 0.003 0.004 0.003 0.005 0.004
Measuring Lookup Speed Results for 1024 lookups:
Measuring Lookup Speed Results for 1024 lookups: Notice that these values increase linearly with the number of elements in the array
Measuring Lookup Speed Results for 1024 lookups: Let’s focus on the results for the largest arrays (which are the most precise)
Measuring Lookup Speed Results for 1024 lookups, 32768 elements:
Optimizing Lookup Speed How to write efficient code: Write code for clarity, not speed Check to see if the code is fast enough. If it is fast enough, stop. Test your code to find where time is being spent Fix the parts of your code that are taking enough time. Go to step 2
Optimizing Lookup Speed How do you make lookups fast? Lookups by position are fastest If you have to lookup up single values by name, write your code with double-brackets Double-bracket lookups are a little faster than single bracket lookups If you discover that your code is too slow, you can easily change from vectors to environments
Optimizing Lookup Speed What if  Your code is too slow You need to look up values by name It would be hard to change your code to use double-bracket notation Define a bracket operator for environments!
Optimizing Lookup Speed Remember that everything in R is a function, even lookup operators.  Example code: > b <- c(Joe=1, Bob=2, Jim=3)> b["Bob"]Bob   2
Optimizing Lookup Speed Translation of the example code: > b["Bob"] Bob    2  > as.list(quote(b["Bob"])) [[1]] `[` [[2]] b [[3]] [1] "Bob"
Optimizing Lookup Speed R translates b["B"] to `[`(b, "B")
Optimizing Lookup Speed Here is the code for our new subset function`[` <- function(x, i, j, ..., drop=TRUE) {   if (class(x) == "environment”) {get(x=i, envir=x)   } else {     .Primitive("[")(x, i, j, ..., drop=TRUE)   } }
Optimizing Lookup Speed Assignments through bracket notation are a little funny. For example, R evaluates x[3:5] <- 13:15 as if this code had been executed: `*tmp*` <- xx <- "[<-"(`*tmp*`, 3:5, value=13:15)rm(`*tmp*`)
Optimizing Lookup Speed Here is the code for our new subset assignment function`[<-` <- function(x, i, j, ..., value) {   if (class(x) == "environment”) {assign(x=i, value=value, envir=x)     # the assign statement returns value,     # but we want to return the environment:x   } else {     .Primitive("[<-")(x, i, j, ..., value)   } }
How to reach me twitter: @jadlerhttp://www.linkedin.com/in/josephadlerbaseballhacks@gmail.com
Backup Slides
A function to test the performance of a lookup function on an object:test_expressions <-function(description, fun, data, reps) {cat(paste(description,""))    results <- vector()    for (n in names(data)) {results[[n]] <- system.time(fun(data[[n]], as.integer(n), reps)      )[["user.self"]]    }print(results)  }
To figure out the full argument list for the bracket operator, use the getGeneric function: > getGeneric("[") standardGeneric for "[" defined from package "base" function (x, i, j, ..., drop = TRUE)  standardGeneric("[", .Primitive("[")) <environment: 0x11a6828> Methods may be defined for arguments: x, i, j, drop Use  showMethods("[")  for currently available ones.
In general, you should set new methods with the setMethod function. Example: setClass("myenv", representation(e="environment"))setMethod("[",signature(x="myenv", i="character”, j="missing"),function(x,i,j,...,drop=TRUE) {       get(x=i,envir=x@e)   })Unfortunately, R doesn’t let you redefine these operators for environments, so we have to do something trickier.

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
PySaprk
PySaprkPySaprk
PySaprk
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic Data
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 

Destaque (20)

Nrs stats
Nrs statsNrs stats
Nrs stats
 
Shahrukh Riviera Maya Honeymoon Options
Shahrukh Riviera Maya Honeymoon OptionsShahrukh Riviera Maya Honeymoon Options
Shahrukh Riviera Maya Honeymoon Options
 
Tynan Group Qualifications
Tynan Group QualificationsTynan Group Qualifications
Tynan Group Qualifications
 
Trabajo universidad indoamerica
Trabajo universidad indoamericaTrabajo universidad indoamerica
Trabajo universidad indoamerica
 
facebook^^
facebook^^facebook^^
facebook^^
 
Media theory
Media theoryMedia theory
Media theory
 
Tenses and auxiliary verbs
Tenses and auxiliary verbsTenses and auxiliary verbs
Tenses and auxiliary verbs
 
The Hookah Handbook
The Hookah HandbookThe Hookah Handbook
The Hookah Handbook
 
Dreams Playa Mujeres
Dreams Playa MujeresDreams Playa Mujeres
Dreams Playa Mujeres
 
Travis and Allison Honeymoon
Travis and Allison HoneymoonTravis and Allison Honeymoon
Travis and Allison Honeymoon
 
Los cabos Wedding Options
Los cabos Wedding OptionsLos cabos Wedding Options
Los cabos Wedding Options
 
UNICO Riviera Maya
UNICO Riviera MayaUNICO Riviera Maya
UNICO Riviera Maya
 
Lauren Jamaica Options
Lauren Jamaica OptionsLauren Jamaica Options
Lauren Jamaica Options
 
Excellence Playa Mujeres
Excellence Playa MujeresExcellence Playa Mujeres
Excellence Playa Mujeres
 
Presentació d'ausiàs march
Presentació d'ausiàs marchPresentació d'ausiàs march
Presentació d'ausiàs march
 
Calabash Cove
Calabash CoveCalabash Cove
Calabash Cove
 
Bullet boy
Bullet boyBullet boy
Bullet boy
 
Davey & lindsey st. lucia
Davey & lindsey st. luciaDavey & lindsey st. lucia
Davey & lindsey st. lucia
 
Jayson Canun
Jayson CanunJayson Canun
Jayson Canun
 
Darius rm
Darius rmDarius rm
Darius rm
 

Semelhante a R meetup talk

Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
g3_nittala
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
alexstorer
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
tirlukachaitanya
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
Jay Luker
 

Semelhante a R meetup talk (20)

Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
Data Structure In C#
Data Structure In C#Data Structure In C#
Data Structure In C#
 
Cs341
Cs341Cs341
Cs341
 
STL ALGORITHMS
STL ALGORITHMSSTL ALGORITHMS
STL ALGORITHMS
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 
Arrays basics
Arrays basicsArrays basics
Arrays basics
 
C++.pptx
C++.pptxC++.pptx
C++.pptx
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
 
Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
 
C++ STL (quickest way to learn, even for absolute beginners).pptx
C++ STL (quickest way to learn, even for absolute beginners).pptxC++ STL (quickest way to learn, even for absolute beginners).pptx
C++ STL (quickest way to learn, even for absolute beginners).pptx
 
C++ STL (quickest way to learn, even for absolute beginners).pptx
C++ STL (quickest way to learn, even for absolute beginners).pptxC++ STL (quickest way to learn, even for absolute beginners).pptx
C++ STL (quickest way to learn, even for absolute beginners).pptx
 
Data visualization in python/Django
Data visualization in python/DjangoData visualization in python/Django
Data visualization in python/Django
 
R language
R languageR language
R language
 
Beginning Scala Svcc 2009
Beginning Scala Svcc 2009Beginning Scala Svcc 2009
Beginning Scala Svcc 2009
 
using python module: doctest
using python module: doctestusing python module: doctest
using python module: doctest
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Leniar datastructure
Leniar datastructureLeniar datastructure
Leniar datastructure
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

R meetup talk

  • 1. Fast lookups in R Joseph AdlerApril 13 2010
  • 2. About me Relevant work Tasks Computer security research Credit risk modeling Pricing strategy Direct marketing Places American Express Johnson and Johnson DoubleClick VeriSign LinkedIn (now)
  • 4. Today’s talk What I wrote If you need to store a big lookup table, consider implementing the table using an environment. Environment objects are implemented using hash tables. Vectors and lists are not. This means that looking up a value with n element in a list can take O(n) time. Looking up the value in an environment object takes O(1) time on average
  • 5. Today’s talk What I read after the book was printed Re: [R] beginner Q: hashtable or dictionary? From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk> Date: Mon 30 Jan 2006 - 18:37:00 ESTOn Sun, 29 Jan 2006, hadleywickham wrote:>> use a 'list': 
> 
> Is a list O(1) for setting and getting?Can you elaborate? R is a vector language, and normally you create a list in one pass, and you can retrieve multiple elements at once.Retrieving elements by name from a long vector (including a list) is very fast, as an internal hash table is used.Does the following item from ONEWS answer your question? Indexing a vector by a character vector was slow if both the vector and index were long (say 10,000). Now hashing is used and the time should be linear in the longer of the lengths (but more memory is used). Indexing by number is O(1) except where replacement causes the list vector to be copied. There is always the option to use match() to convert to numeric indexing. -- Brian D. Ripley, Professor of Applied Statistics, University of Oxford Retrieving elements by name from a long vector (including a list) is very fast, as an internal hash table is used. Professor Brian D. Ripley
  • 6. Today’s talk A short introduction to objects in R Looking up values in R How lookup tables are implemented in R Measuring lookup speed Optimizing lookup speed
  • 7. Objects in R Everything in R is an object. Here are some examples of objects. Numeric Vector: > onehalf <- 1/2 > class(onehalf) [1] "numeric”
  • 8. Objects in R Integer Vector: > four <- as.integer(4) > four [1] 4 > class(four) [1] "integer”
  • 9. Objects in R Character vector: > zero <- "zero" > class(zero) [1] "character”
  • 10. Objects in R Logical vector: > this.is.interesting <- FALSE > class(this.is.interesting) [1] "logical"
  • 11. Objects in R Vectors can have multiple elements > one.to.five <- 1:5 > class(one.to.five) [1] "integer" > six.to.ten <- c(6, 7, 8, 9, 10) > class(six.to.ten) [1] "numeric"
  • 12. Objects in R Lists contain heterogeneous collections of objects > stuff <- list(3.14, "hat", FALSE) > class(stuff) [1] "list"
  • 13. Objects in R Functions are also objects in R: > f <- function(x, y) {+ x + y+ }> ffunction(x, y) { x + y}> class(f)[1] "function"
  • 14. Objects in R Environments map names to objects. They are used within R itself to map variable names to objects. You can access these environment objects, or create your own. > one <- 1 > two <- 2 > three <- 3 > objects() [1] "one" "three" "two" > e <- .GlobalEnv > class(e) [1] "environment" > objects(e) [1] "e" "one" "three" "two"
  • 15. Lookups You can look up an item in a vector, list, or array within R Let’s define a vector:> a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)> a[1] 1 2 3 4 5 6 7 8 9 10 You can refer to elements by index:> a[3][1] 3
  • 16. Lookups It's also possible to name elements in a vector, then refer to them by name: > b <- c(Joe=1, Bob=2, Jim=3)> b["Bob"]Bob This can be very convenient: you can use every vector in R as a table. You can access the name vector through the names function: > names(b)[1] "Joe" "Bob" "Jim"
  • 17. Lookups Named vectors in R are implemented using two different arrays: a.20 names(a.20)
  • 18. Lookups The name lookup algorithm works roughly like this: function(vector, name) { for (i in 1:length(vector)) { if (names(vector)[i] == name) return vector[i] } return NA
  • 19. Lookups Example: Look up a.20[“F”] a.20 names(a.20)
  • 20. Lookups Example: Look up a.20[“F”] a.20 names(a.20) names(a.20)[1]
  • 21. Lookups Example: Look up a.20[“F”] a.20 names(a.20) names(a.20)[2]
  • 22. Lookups Example: Look up a.20[“F”] a.20 names(a.20) names(a.20)[4]
  • 23. Lookups Example: Look up a.20[“F”] a.20 names(a.20) names(a.20)[4]
  • 24. Lookups Example: Look up a.20[“F”] a.20 names(a.20) names(a.20)[5]
  • 25. Lookups Example: Look up a.20[“F”] a.20 names(a.20) names(a.20)[5]
  • 26. Lookups In vectors, Looking up a value by index takes a constant amount of time. Looking up a value by name (potentially) requires looking at every name in the names array. (This means that lookup times scale linearly with the number of items in the table.)
  • 27. Lookups Environments store (and fetch) data using a different structure. They use hash tables. Hash tables rely on a hash function to map labels to indices.
  • 28. Lookups Simple hash table implementation Example: store 15 ¾ for “Joe” Calculate h(“Joe”) Store 15 ¾ in thetable in slot h(“Joe”) h(“Joe”) = 4
  • 29. Lookups If you carefully choose the size of the hash table and the hash function, you can store and lookup values in constant time (on average) in hash tables.
  • 30. Measuring Lookup Speed In theory, looking up values in environments should be faster than looking up values in vectors. In practice, how much difference does this make? Let’s measure how much time it takes to look up values in vectors and environments, using different lookup methods
  • 31. Measuring Lookup Speed Let's build a large, labeled vector for testing: labeled.array<- function(n) {a <- 1:nfrom <- “1234567890"to <- "ABCDEFGHIJ"for (i in 1:n) {names(a)[i] <- chartr(from, to, i) }a } Here's an example of the output of this function:> a.20 <- labeled.array(20)> a.20A B C D E F G H I AJ AA AB AC AD AE AF AG AH AI BJ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
  • 32. Measuring Lookup Speed Let's also create environment objects for testing:labeled.environment <- function(n) {e <- new.env(hash=TRUE, size=n) from <- "1234567890” to <- "ABCDEFGHIJ” for (i in 1:n) {assign(x=chartr(from, to, i), value=i, envir=e) }e} Here’s an example of the output of this function: > e.20 <- labeled.environment(20) > e.20 <environment: 0x143756c>
  • 33. Measuring Lookup Speed You can fetch values from an environment object with the get function > get("A",envir=e.20)[1] 1> get("BA",envir=e.20)[1] 20 You can also fetch values from an environment with the double bracket operator > e.20[["A"]][1] 1> e.20[["BA"]][1] 20
  • 34. Measuring Lookup Speed Creating examples for testingarrays <- list()for (i in 10:15) {arrays[[as.character(2 ** i)]] <-labeled.array(2 ** i)}environments <- list()for (i in 10:15) {environments[[as.character(2 ** i)]] <-labeled.environment(2 ** i)}
  • 35. Measuring Lookup Speed Using the test function:test_expressions("first element, by index:",function(d,l,r) {s<- 0 for (v in 1:r) {s<- s + d[1] }},arrays, 1024) Output:first element, by index:1024 2048 4096 8192 16384 327680.010 0.003 0.004 0.003 0.005 0.004
  • 36. Measuring Lookup Speed Results for 1024 lookups:
  • 37. Measuring Lookup Speed Results for 1024 lookups: Notice that these values increase linearly with the number of elements in the array
  • 38. Measuring Lookup Speed Results for 1024 lookups: Let’s focus on the results for the largest arrays (which are the most precise)
  • 39. Measuring Lookup Speed Results for 1024 lookups, 32768 elements:
  • 40. Optimizing Lookup Speed How to write efficient code: Write code for clarity, not speed Check to see if the code is fast enough. If it is fast enough, stop. Test your code to find where time is being spent Fix the parts of your code that are taking enough time. Go to step 2
  • 41. Optimizing Lookup Speed How do you make lookups fast? Lookups by position are fastest If you have to lookup up single values by name, write your code with double-brackets Double-bracket lookups are a little faster than single bracket lookups If you discover that your code is too slow, you can easily change from vectors to environments
  • 42. Optimizing Lookup Speed What if Your code is too slow You need to look up values by name It would be hard to change your code to use double-bracket notation Define a bracket operator for environments!
  • 43. Optimizing Lookup Speed Remember that everything in R is a function, even lookup operators. Example code: > b <- c(Joe=1, Bob=2, Jim=3)> b["Bob"]Bob 2
  • 44. Optimizing Lookup Speed Translation of the example code: > b["Bob"] Bob 2 > as.list(quote(b["Bob"])) [[1]] `[` [[2]] b [[3]] [1] "Bob"
  • 45. Optimizing Lookup Speed R translates b["B"] to `[`(b, "B")
  • 46. Optimizing Lookup Speed Here is the code for our new subset function`[` <- function(x, i, j, ..., drop=TRUE) { if (class(x) == "environment”) {get(x=i, envir=x) } else { .Primitive("[")(x, i, j, ..., drop=TRUE) } }
  • 47. Optimizing Lookup Speed Assignments through bracket notation are a little funny. For example, R evaluates x[3:5] <- 13:15 as if this code had been executed: `*tmp*` <- xx <- "[<-"(`*tmp*`, 3:5, value=13:15)rm(`*tmp*`)
  • 48. Optimizing Lookup Speed Here is the code for our new subset assignment function`[<-` <- function(x, i, j, ..., value) { if (class(x) == "environment”) {assign(x=i, value=value, envir=x) # the assign statement returns value, # but we want to return the environment:x } else { .Primitive("[<-")(x, i, j, ..., value) } }
  • 49. How to reach me twitter: @jadlerhttp://www.linkedin.com/in/josephadlerbaseballhacks@gmail.com
  • 51. A function to test the performance of a lookup function on an object:test_expressions <-function(description, fun, data, reps) {cat(paste(description,"")) results <- vector() for (n in names(data)) {results[[n]] <- system.time(fun(data[[n]], as.integer(n), reps) )[["user.self"]] }print(results) }
  • 52. To figure out the full argument list for the bracket operator, use the getGeneric function: > getGeneric("[") standardGeneric for "[" defined from package "base" function (x, i, j, ..., drop = TRUE) standardGeneric("[", .Primitive("[")) <environment: 0x11a6828> Methods may be defined for arguments: x, i, j, drop Use showMethods("[") for currently available ones.
  • 53. In general, you should set new methods with the setMethod function. Example: setClass("myenv", representation(e="environment"))setMethod("[",signature(x="myenv", i="character”, j="missing"),function(x,i,j,...,drop=TRUE) {       get(x=i,envir=x@e)  })Unfortunately, R doesn’t let you redefine these operators for environments, so we have to do something trickier.

Notas do Editor

  1. I have about fifteen years of experience in data mining and data analysis. I’ve worked in a variety of industries: financial services, pharmaceuticals, internet companies.
  2. And I’ve written a couple books on data analysis. Today’s talk isn’t about a subject in either book, but it is inspired by a passagein the second book.
  3. Before I start today’s talk, I want to explain to you why I’m talking about this topic.In my book, one of my chapters is devoted to performance tips. One of my performance tips was about how to quickly look up a value in a table of values.
  4. Then, I was reading through some old comments on R mailing lists and ran into this message.How many people in the room own a copy of this book? &lt;Pick up MASS book&gt; (For those who don’t, how many have used the MASS library?)So, the guy who wrote this email is the guy who wrote this book (and the MASS package)This made me feel really nervous that I had written something incorrect, so I decided to take a closer look at how tables are implemented in R.Today, I’m going to tell you about how lookups in R work, how I tested their performance, and how you can use this information to help you write faster R code.
  5. Today, I’m going to tell you the story of how I tested the performance of different lookup methods in RI’m going to give a short introduction to different types of objects in R,Then explain to you how I tested performance(testing performance used some interesting features in R)Next, I will tell you about the results And if you’re all still awake, I will tell you how to optimize your program’s performance
  6. Everything in R is an object. We will start by looking at a few simple data types in R.The data type that you will probably encounter most frequently in R is the numeric vector.Numeric vectors represent numeric values.The class function tells you the class of an object; the class tells R what methods (or functions) can be applied to an object
  7. Here is another example of a data type in R: integers.Notice that I use the function as.integer to explicitly request an integerIf you were to just type 4, R would return a numeric value
  8. Here is another important example of an objectCharacter vectors represent text valuesIn many other languages, these are called strings
  9. Another example data type is the logical vectorAll of the example so far have been vectors with one elementBut of course, vectors can have multiple elements. Let’s look at a couple examples
  10. The colon operator is used to define a sequence of values. It always returns integers. (A trick to return a single integer is to just have a range from one value to itself.)The combine function (“c”) is used to combine a set of values together into a vector.
  11. If you need to represent a heterogenous collection of objects, you can use a list.A very common type of list is a data frame. Data frames are like database tables (or tables in Excel); they contain multiple columns representing different variables in a data set.
  12. Everything in R is an objectEven functions
  13. Let’s move on to another important type of object.If you work with R, you have probably used vectors and lists. You have also used environment objects, but you may not have realized itAt any time in R, there are a set of objects that you can access. You may have given these objects names. R represents these relationships as environments.In the example session that I show here, I created three objects, named “one”, “two” and “three”R stored information mapping these names to these values in an environment called the global environmentI assigned the symbol “e” to point to the global environment (environments are just objects, like everything else in R)Then I showed the class of “e”I also used the objects function to show the objects defined in this environment. Notice that the objects include one, two, three, and e.
  14. Now, let’s talk about how you look up a value in an object in R.To do this, we’ll define a simple example vector. Here, I defined a vector named “a” with ten valuesYou can use the bracket operator to refer to a specific location. In this example, I looked up the third item in a, which was the value 3.
  15. (next page shows algorithm)(then walks through example)
  16. As an example, we will show how R looks up the value with the label “F” in the array “a.20”To do this, R iterates through each value in the names array to find the index of the correct value. Then R returns the correct value. &lt;next slide&gt;
  17. R looks up the first item in the names array, which does not match.
  18. Then, R looks up the second item and checks if it matches.
  19. R continues to iterate through the names array until it find the match.
  20. Ah, found the matching value. The index for the match is 5
  21. Here is a simple example of how hash table workI’m leaving out some important details here.- Most importantly, I don’t explain what to do when two labels hash to the same value (this is called a hash collision).- Nor do I talk about how you choose the hash table size, or the hash function.- A full discussion of hash functions is beyond the scope of this talk. (It’s beyond the scope of most algorithms classes!)
  22. Notice that R doesn’t print out environment objects in a friendly way.
  23. For testing, I generated a set of different arrays and environments with between 1024 and 32768 elementsI generated one object for each power of two&lt;go to next page&gt;
  24. To test the lookup speed, I wrote a function called “test expressions” that would Print a message Time how long it took to apply a function to a set of different sized data objects many times You can specify the message, the function, the set of data objects, and the number of repetitions (for each objectNotice that this function takes another function as an argument!In the example here, I show how I tested the performance of looking up the first value in each object by index. (I calculated a sum rather than just returning values.)
  25. Here are the results from my tests.How many people think that I should use a chart to present this data?As a show of hands, how many people in this room have read Tufte’s books?How many people raised you hand for both?Seriously, I don’t think that this is enough data to bother plotting. It’s hard to read on the screen (because the type is small), but the trends are so clear that you can see them by just looking at numbers.Let me show you some interesting trends.
  26. First, let’s look at the array lookups by name. Notice that these values increase linearly with the number of elements in the array
  27. Now, let’s focus on the results for the biggest arrays&lt;change to next slide&gt;
  28. There are two key takeaways.First, looking up a single value in an array (by index), or an environment (by symbol) is very fast, regardless of table size.Next, notice that lookups by name are much, much slower in arrays. The only exception is looking up the first value in an array by double bracket. Double bracket notation is a little faster.So, what does this mean? &lt;turn to next page&gt;
  29. You could always use environment objects instead of vectors to store tables of values.But I think that will lead you to write more code.You should use whatever method is simplest and easiest to implement your program. When you know that it runs correctly, then you can optimize it.Here is the process that I use to write efficient code.
  30. By the way, even R language expressions are objects in R. That’s how I can show how R parses this expression here.