SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
Streaming Data,
Concurrency And R

     Rory Winston

   rory@theresearchkitchen.com
About Me




      Independent Software Consultant
      M.Sc. Applied Computing, 2000
      M.Sc. Finance, 2008
      Apache Committer
      Working in the financial sector for the last 7 years or so
      Interested in practical applications of functional languages and
      machine learning
      Relatively recent convert to R ( ≈ 2 years)
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
Parallelization vs. Concurrency



        R interpreter is single threaded
        Some historical context for this (BLAS implementations)
        Not necessarily a limitation in the general context
        Multithreading can be complex and problematic
        Instead a focus on parallelization:
             Distributed computation: gridR, nws, snow
             Multicore/multi-cpu scaling: Rmpi, Romp, pnmath/pnmath0
             Interfaces to Pthreads/PBLAS/OpenMP/MPI/Globus/etc.
        Parallelization suits cpu-bound large data processing
        applications
Other Scalability and Performance Work




        JIT/bytecode compilation (Ra)
        Implicit vectorization a la Matlab (code analysis)
        Large (≥ RAM) dataset handling (bigmemory,ff)
        Many incremental performance improvements (e.g. less
        internal copying)
        Next: GPU/massive multicore...?
What Benefit Concurrency?




       Real-time (streaming to be more precise) data analysis
       Growing Interest in using R for streaming data, not just offline
       analyis
       GUI toolkit integration
       Fine-grained control over independent task execution
       "I believe that explicit concurrency management tools (i.e. a
       threads toolkit) are what we really need in R at this point." -
       Luke Tierney, 2001
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Example Application




        Based on work I did last year and presented at UseR! 2008
        Wrote a real-time and historical market data service from
        Reuters/R
        The real-time interface used the Reuters C++ API
        R extension in C++ that spawned listening thread and
        handled updates
Simplified Architecture




                                R


                         extension (C++)



                           realtime bus
Example Usage



          rsub <- function(duration, items, callback)


   The call rsub will subscribe to the specified rate(s) for the duration
   of time specified by duration (ms). When a tick arrives, the
   callback function callback is invoked, with a data frame
   containing the fields specified in items.

   Multiple market data items may be subscribed to, and any
   combination of fields may be be specified.

   Uses the underlying RFA API, which provides a C++ interface to
   real-time market updates.
Real-Time Example


   # Specify field names to retrieve
   fields <- c("BID","ASK","TIMCOR")

   # Subscribe to EUR/USD and GBP/USD ticks
   items <- list()
   items[[1]] <- c("IDN_SELECTFEED", "EUR=", fields)
   items[[2]] <- c("IDN_SELECTFEED", "GBP=", fields)

   # Simple Callback Function
   callback <- function(df) { print(paste("Received",df)) }

   # Subscribe for 1 hour
   ONE_HOUR <- 1000*(60)^2
   rsub(ONE_HOUR, items, callback)
Issues With This Approach




        As R interpreter is single threaded, cannot spawn thread for
        callbacks
        Thus, interpreter thread is locked for the duration of
        subscription
        Not a great user experience
        Need to find alternative mechanism
Alternative Approach



        If we cannot run subscriber threads in-process, need to
        decouple
        Standard approach: add an extra layer and use some form of
        IPC
        For instance, we could:
            Subscribe in a dedicated R process (A)
            Push incoming data onto a socket
            R process (B) reads from a listening socket
        Sockets could also be another IPC primitive, e.g. pipes
        Also note that R supports asynchronous I/O (?isIncomplete)
        Look at the ibrokers package for examples of this
The bigmemoRy package



       From the description: "Use C++ to create, store,
       access, and manipulate massive matrices"
       Allows creation of large matrices
       These matrices can be mapped to files/shared memory
       It is the shared memory functionality that we will use
       The next version (3.0) will be unveiled at UseR! 2009

   big.matrix(nrow, ncol, type = "integer", ....)
   shared.big.matrix(nrow, ncol, type = "integer", ...)
   filebacked.big.matrix(nrow, ncol, type = "integer", ...)
Sample Usage




   > library(bigmemory) # Note: I'm using pre-release
   > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000)
   > X
   An object of class “big.matrix”
   Slot "address":
   <pointer: 0x7378a0>
Create Shared Memory Descriptor

   > desc <- describe(X)
   > desc
   $sharedType
   [1] "SharedMemory"

   $sharedName
   [1] "53f14925-dca1-42a8-a547-e1bccae999ce"

   $nrow
   [1] 1000

   $ncol
   [1] 1000

   $rowNames
   NULL
Export the Descriptor




    In R session 1:

    > dput(desc, file="~/matrix.desc")

    In R session 2:

    > library(bigmemory)
    > desc <- dget("~/matrix.desc")
    > X <- attach.big.matrix(desc)

    Now R sessions A and B share the same big.matrix instance
Share Data Between Sessions




   R session 1:

   > X[1,1] <- 1.2345

   R session 2:

   > X[1,1]
   [1] 1.2345

   Thus, streaming data can be continuously fed into session A
   And concurrently processed in session B
Summary




      Lack of threads not a barrier to concurrent analysis
      Packages like bigmemory, nws, etc. facilitate decoupling via
      IPC
      nws goes a step further, with a distributed workspace
      Many applications for streaming data:
          Data collection/monitoring
          Development of pricing/risk algorithms
          Low-frequency execution (??)
          ...
References




        http://cran.r-project.org/web/packages/bigmemory/
        http://www.cs.uiowa.edu/ luke/R/thrgui/
        http://www.milbo.users.sonic.net/ra/index.html
        http://www.cs.kent.ac.uk/projects/cxxr/
        http://www.theresearchkitchen.com/blog

Mais conteúdo relacionado

Destaque

Kat01 2012
Kat01 2012Kat01 2012
Kat01 2012hekama
 
conroling slides by sohar bakhsh
conroling slides by sohar bakhshconroling slides by sohar bakhsh
conroling slides by sohar bakhshSohar Bakhsh
 
Equine Emergencies Part 4
Equine Emergencies Part 4Equine Emergencies Part 4
Equine Emergencies Part 4Ernie Martinez
 
7. susret 17.11.2011. konkretno lice boga oca
7. susret 17.11.2011.   konkretno lice boga oca7. susret 17.11.2011.   konkretno lice boga oca
7. susret 17.11.2011. konkretno lice boga ocaMeri-Lucijeta
 
Figurative Painter - Vicente Romero Redondo
Figurative Painter - Vicente Romero RedondoFigurative Painter - Vicente Romero Redondo
Figurative Painter - Vicente Romero RedondoMakala (D)
 
Best kitchen knives
Best kitchen knivesBest kitchen knives
Best kitchen knivesbestkit3
 
5 Worst States for Identity Theft
5 Worst States for Identity Theft5 Worst States for Identity Theft
5 Worst States for Identity TheftIDT911
 
Food Combining For Beginners.
Food Combining For Beginners.Food Combining For Beginners.
Food Combining For Beginners.mikefouse
 
Market advertizing
Market advertizingMarket advertizing
Market advertizingSohar Bakhsh
 
Unit 1-vocab jarod f
Unit 1-vocab jarod fUnit 1-vocab jarod f
Unit 1-vocab jarod fjarodf2238
 
Safety Meeting Starter (SMS) Aug 2012
Safety Meeting Starter (SMS) Aug 2012Safety Meeting Starter (SMS) Aug 2012
Safety Meeting Starter (SMS) Aug 2012safestrat
 
Unit fourteen will future
Unit fourteen will futureUnit fourteen will future
Unit fourteen will futurewedaa23
 

Destaque (17)

Pti finish
Pti finishPti finish
Pti finish
 
Yoleo
YoleoYoleo
Yoleo
 
Kat01 2012
Kat01 2012Kat01 2012
Kat01 2012
 
Inventario
InventarioInventario
Inventario
 
conroling slides by sohar bakhsh
conroling slides by sohar bakhshconroling slides by sohar bakhsh
conroling slides by sohar bakhsh
 
Equine Emergencies Part 4
Equine Emergencies Part 4Equine Emergencies Part 4
Equine Emergencies Part 4
 
7. susret 17.11.2011. konkretno lice boga oca
7. susret 17.11.2011.   konkretno lice boga oca7. susret 17.11.2011.   konkretno lice boga oca
7. susret 17.11.2011. konkretno lice boga oca
 
Figurative Painter - Vicente Romero Redondo
Figurative Painter - Vicente Romero RedondoFigurative Painter - Vicente Romero Redondo
Figurative Painter - Vicente Romero Redondo
 
slideshow_funerals
slideshow_funeralsslideshow_funerals
slideshow_funerals
 
Best kitchen knives
Best kitchen knivesBest kitchen knives
Best kitchen knives
 
5 Worst States for Identity Theft
5 Worst States for Identity Theft5 Worst States for Identity Theft
5 Worst States for Identity Theft
 
Food Combining For Beginners.
Food Combining For Beginners.Food Combining For Beginners.
Food Combining For Beginners.
 
Market advertizing
Market advertizingMarket advertizing
Market advertizing
 
Unit 1-vocab jarod f
Unit 1-vocab jarod fUnit 1-vocab jarod f
Unit 1-vocab jarod f
 
Safety Meeting Starter (SMS) Aug 2012
Safety Meeting Starter (SMS) Aug 2012Safety Meeting Starter (SMS) Aug 2012
Safety Meeting Starter (SMS) Aug 2012
 
Why Is Tympanometry Performed?
Why Is Tympanometry Performed?Why Is Tympanometry Performed?
Why Is Tympanometry Performed?
 
Unit fourteen will future
Unit fourteen will futureUnit fourteen will future
Unit fourteen will future
 

Semelhante a Streaming Data and Concurrency in R

Building Europeana - The Rivers
Building Europeana - The RiversBuilding Europeana - The Rivers
Building Europeana - The RiversEuropeana
 
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...Flexsin
 
IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012Tom-Cramer
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Workflows in the Virtual Observatory
Workflows in the Virtual ObservatoryWorkflows in the Virtual Observatory
Workflows in the Virtual ObservatoryJose Enrique Ruiz
 
Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?hemayadav41
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseDATAVERSITY
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Gautier Poupeau
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFiHortonworks
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel ArchitecturesJoel Falcou
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
End-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooEnd-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooJason Dai
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
 
Open Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data CenterOpen Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data CenterCole Crawford
 
Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)lennartkats
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution Analytics
 

Semelhante a Streaming Data and Concurrency in R (19)

Building Europeana - The Rivers
Building Europeana - The RiversBuilding Europeana - The Rivers
Building Europeana - The Rivers
 
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Workflows in the Virtual Observatory
Workflows in the Virtual ObservatoryWorkflows in the Virtual Observatory
Workflows in the Virtual Observatory
 
Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
End-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooEnd-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics Zoo
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
Open Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data CenterOpen Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data Center
 
Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Revolution Analytics Podcast
Revolution Analytics PodcastRevolution Analytics Podcast
Revolution Analytics Podcast
 

Último

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Streaming Data and Concurrency in R

  • 1. Streaming Data, Concurrency And R Rory Winston rory@theresearchkitchen.com
  • 2. About Me Independent Software Consultant M.Sc. Applied Computing, 2000 M.Sc. Finance, 2008 Apache Committer Working in the financial sector for the last 7 years or so Interested in practical applications of functional languages and machine learning Relatively recent convert to R ( ≈ 2 years)
  • 3. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 4. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 5. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 6. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 7. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 8. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 9. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 10. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 11. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 12. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 13. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 14. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 15. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 16. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 17. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 18. Parallelization vs. Concurrency R interpreter is single threaded Some historical context for this (BLAS implementations) Not necessarily a limitation in the general context Multithreading can be complex and problematic Instead a focus on parallelization: Distributed computation: gridR, nws, snow Multicore/multi-cpu scaling: Rmpi, Romp, pnmath/pnmath0 Interfaces to Pthreads/PBLAS/OpenMP/MPI/Globus/etc. Parallelization suits cpu-bound large data processing applications
  • 19. Other Scalability and Performance Work JIT/bytecode compilation (Ra) Implicit vectorization a la Matlab (code analysis) Large (≥ RAM) dataset handling (bigmemory,ff) Many incremental performance improvements (e.g. less internal copying) Next: GPU/massive multicore...?
  • 20. What Benefit Concurrency? Real-time (streaming to be more precise) data analysis Growing Interest in using R for streaming data, not just offline analyis GUI toolkit integration Fine-grained control over independent task execution "I believe that explicit concurrency management tools (i.e. a threads toolkit) are what we really need in R at this point." - Luke Tierney, 2001
  • 21. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 22. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 23. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 24. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 25. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 26. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 27. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 28. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 29. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 30. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 31. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 32. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 33. Example Application Based on work I did last year and presented at UseR! 2008 Wrote a real-time and historical market data service from Reuters/R The real-time interface used the Reuters C++ API R extension in C++ that spawned listening thread and handled updates
  • 34. Simplified Architecture R extension (C++) realtime bus
  • 35. Example Usage rsub <- function(duration, items, callback) The call rsub will subscribe to the specified rate(s) for the duration of time specified by duration (ms). When a tick arrives, the callback function callback is invoked, with a data frame containing the fields specified in items. Multiple market data items may be subscribed to, and any combination of fields may be be specified. Uses the underlying RFA API, which provides a C++ interface to real-time market updates.
  • 36. Real-Time Example # Specify field names to retrieve fields <- c("BID","ASK","TIMCOR") # Subscribe to EUR/USD and GBP/USD ticks items <- list() items[[1]] <- c("IDN_SELECTFEED", "EUR=", fields) items[[2]] <- c("IDN_SELECTFEED", "GBP=", fields) # Simple Callback Function callback <- function(df) { print(paste("Received",df)) } # Subscribe for 1 hour ONE_HOUR <- 1000*(60)^2 rsub(ONE_HOUR, items, callback)
  • 37. Issues With This Approach As R interpreter is single threaded, cannot spawn thread for callbacks Thus, interpreter thread is locked for the duration of subscription Not a great user experience Need to find alternative mechanism
  • 38. Alternative Approach If we cannot run subscriber threads in-process, need to decouple Standard approach: add an extra layer and use some form of IPC For instance, we could: Subscribe in a dedicated R process (A) Push incoming data onto a socket R process (B) reads from a listening socket Sockets could also be another IPC primitive, e.g. pipes Also note that R supports asynchronous I/O (?isIncomplete) Look at the ibrokers package for examples of this
  • 39. The bigmemoRy package From the description: "Use C++ to create, store, access, and manipulate massive matrices" Allows creation of large matrices These matrices can be mapped to files/shared memory It is the shared memory functionality that we will use The next version (3.0) will be unveiled at UseR! 2009 big.matrix(nrow, ncol, type = "integer", ....) shared.big.matrix(nrow, ncol, type = "integer", ...) filebacked.big.matrix(nrow, ncol, type = "integer", ...)
  • 40. Sample Usage > library(bigmemory) # Note: I'm using pre-release > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000) > X An object of class “big.matrix” Slot "address": <pointer: 0x7378a0>
  • 41. Create Shared Memory Descriptor > desc <- describe(X) > desc $sharedType [1] "SharedMemory" $sharedName [1] "53f14925-dca1-42a8-a547-e1bccae999ce" $nrow [1] 1000 $ncol [1] 1000 $rowNames NULL
  • 42. Export the Descriptor In R session 1: > dput(desc, file="~/matrix.desc") In R session 2: > library(bigmemory) > desc <- dget("~/matrix.desc") > X <- attach.big.matrix(desc) Now R sessions A and B share the same big.matrix instance
  • 43. Share Data Between Sessions R session 1: > X[1,1] <- 1.2345 R session 2: > X[1,1] [1] 1.2345 Thus, streaming data can be continuously fed into session A And concurrently processed in session B
  • 44. Summary Lack of threads not a barrier to concurrent analysis Packages like bigmemory, nws, etc. facilitate decoupling via IPC nws goes a step further, with a distributed workspace Many applications for streaming data: Data collection/monitoring Development of pricing/risk algorithms Low-frequency execution (??) ...
  • 45. References http://cran.r-project.org/web/packages/bigmemory/ http://www.cs.uiowa.edu/ luke/R/thrgui/ http://www.milbo.users.sonic.net/ra/index.html http://www.cs.kent.ac.uk/projects/cxxr/ http://www.theresearchkitchen.com/blog