This document discusses R and Julia for data analysis and advanced analytics. It provides an overview of R's history, how it works, performance improvements, and use in production. Julia is introduced as a new high-performance dynamic language with similarities to R but faster performance due to its just-in-time compiler and type information. Examples are given comparing the performance of Julia to other languages. The document recommends Julia for those already using C/Fortran and suggests it will be useful for R users once fully developed.
6. Major R Versions
Version Description
0
1996
Initial release: University of Auckland, New Zealand
1
2000
Completeness and stability high enough to characterize a full statistical system, which could be put
to production use
2
2004
Strong enhancements of the memory management subsystem as well as several major features,
including Sweave (into LaTeX or LyX).
3
2013
The inclusion of long vectors (containing more than 2^31-1 elements!). Also, we now have 64 bit
support on all platforms, support for parallel processing, the Matrix package
http://www.r-project.org/
7. How R Works
As with an automobile, you can use R without worrying very much about how it
works.
But computing with data is more complicated than driving a car (fortunately for
highway safety)
John Chambers
Software for Data Analysis, page 453
8. R works in a shell
Cross-platform, including Windows x32 or x64
Interactive graphical user interface (GUI) to interpret commands
Read – accept user input
Parse -- interpret input using expected syntax
Evaluate – execute commands
Everything is an object
Data are stored in data frames, named lists
R implements S language grammar, with a few extensions
14. Estimated R Usage
Estimated 250,000 people use it regularly (as of 2009)
http://www.nytimes.com/2009/01/07/technology/business-
computing/07program.html?pagewanted=2&_r=0
20. R is Memory-Bound
𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒
4
= 𝐴𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑅 𝐷𝑎𝑡𝑎
Source: Joseph B. Rickert, February 14, 2013
64𝑏𝑖𝑡 𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒 = 𝑅𝐴𝑀
32𝑏𝑖𝑡 𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒 = 𝑈𝑠𝑒𝑟 𝑉𝑖𝑟𝑡𝑢𝑎𝑙 𝑀𝑒𝑚𝑜𝑟𝑦 − 0.5𝐺𝐵 ≅ 2 𝐺𝐵
Source: http://cran.r-project.org/bin/windows/base/rw-FAQ.html retrieved March 1,
2013
21. R is Memory-Bound
All objects in an R session are stored in memory
R places a limit of 231 − 1 bytes on all object sizes, independent of RAM
The Art of R Programming, Norman Matloff
22. R Memory Management
Automatic including garbage collection
rm()removes object assignment, but does not delete memory
gc() forces garbage collection with substantial computation
23. Improving Performance
The Art of R Programming, Chapter 14, Norman Matloff
Power
Simplicity
Vectorization Byte-Code Compilation
Parallel RC/C++
24. Improving Performance
Method Description
C/C++ Call C programs from R
Vectorization Recode for vectorization replacing slower functions
Byte-code compilation cmpfun()
Parallel R parallel package
http://cran.r-project.org/web/views/HighPerformanceComputing.html
25. Improving Performance
Rprof()– measures speed of functions
ff – memory-efficient storage of large data on disk and fast access functions
bigmemory -- Manage massive matrices with shared memory and memory-
mapped files
27. Derivative ProjectsRStudio – Integrated Development Environment (IDE)
Rattle – Data Mining Package
RExcel – (Statconn) Connection between R and Excel
Weka – Java-based data mining, statistical analysis by R
RapidMiner – Java-based Weka data mining, statistical analysis by R
Revolution Analytics – Scaling R for the Enterprise
Oracle R Enterprise – Integrated into Oracle
28. About Statconn (as of March 2013)
Produces RAndFriends under noncommercial and commercial licenses
All the statconn tools work ONLY with 32-bit R
statconnDCOM
rcom (GPL2, but requires statconnDCOM)
RExcel 3.2.9 (ONLY 32-bit Office: 2003, 2007, 2010)
http://rcom.univie.ac.at/
29. Sample Projects Using R
The Heritage Health Prize, Thomas Nguyen
A Direct Marketing In-flight Forecasting System, Shannon Terry & Ben Ogorek
Mining Twitter for Airline Consumer Sentiment, Jeffrey Breen
Alternative Data Sources for Measuring Market Sentiment and Events (Using R), Joe
Rothermich
31. About Julia
High-level, high-performance dynamic open-source programming language for technical
computing
Syntax similar to other technical computing environments
Features
Sophisticated compiler
Distributed parallel execution
Numerical accuracy
Extensive mathematical function library
Uses C, C++, Fortran libraries extensively
32. Why Julia: “Because we are greedy”
http://julialang.org/blog/2012/04/nyc-open-stats-meetup-announcement/
33. Julia Community
Hosted on github
550 mailing list subscribers (Google Groups)
1,500 github followers
190 forks
50 total contributors
As of September 2012, all contributors except the core developers had known
of the language for six months or less
Julia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski,
Shah, Edelman
38. Key Ingredients of Julia Performance
Rich type information, provided naturally by multiple dispatch
Aggressive code specialization against run-time types
Julia’s LLVM-based just-in-time (JIT) compiler
Julia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski,
Shah, Edelman
41. Julia Recommendations
The software is ready for people already using C or Fortran
The software will develop into a usable scripting language for R users
Wait until version one for production use