O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Data analysis with R and Julia

5.848 visualizações

Publicada em

R is a free, open-source environment for statistical analysis and graphing. In its almost 20 years of existence, R has remained popular in both academic and business environments. The newer Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. This session outlines functional and performance differences between these two software packages. You’ll see demonstrations of best tips for integrating this software with Windows and walk away with guidelines for working with commercial software. A version of this presentation had 100 attendees at the PASS Business Analytics Conference in Chicago (April 2013), and 40 attendees for the PASS Virtual Business Analytics meeting (May 2013).

Publicada em: Negócios

Data analysis with R and Julia

  1. 1. Data Analysis with R and JuliaAdvanced Analytics and InsightsMark Tabladillo Ph.D., Data Mining Scientist, MarkTab Inc.
  2. 2. NetworkingInteractive
  3. 3. About MarkTabTraining and Consulting withhttp://marktab.comData Mining Resources and Blog athttp://marktab.netTwitter @marktabnet
  4. 4. OutlineR LanguageMarket AnalysisPerformanceProduction UseJulia LanguagePerformance
  5. 5. The R Languagehttp://cran.r-project.org
  6. 6. Major R VersionsVersion Description01996Initial release: University of Auckland, New Zealand12000Completeness and stability high enough to characterize a full statistical system, which could be putto production use22004Strong enhancements of the memory management subsystem as well as several major features,including Sweave (into LaTeX or LyX).32013The inclusion of long vectors (containing more than 2^31-1 elements!). Also, we now have 64 bitsupport on all platforms, support for parallel processing, the Matrix packagehttp://www.r-project.org/
  7. 7. How R WorksAs with an automobile, you can use R without worrying very much about how itworks.But computing with data is more complicated than driving a car (fortunately forhighway safety)John ChambersSoftware for Data Analysis, page 453
  8. 8. R works in a shellCross-platform, including Windows x32 or x64Interactive graphical user interface (GUI) to interpret commandsRead – accept user inputParse -- interpret input using expected syntaxEvaluate – execute commandsEverything is an objectData are stored in data frames, named listsR implements S language grammar, with a few extensions
  9. 9. R GUI
  10. 10. Read-Parse-Evaluate LoopReadParseEvaluate
  11. 11. R and SQL Serverinstall.packages("RODBC")library(RODBC)MDAC Downloads
  12. 12. R Market Analysis
  13. 13. Listserv Discussionhttp://r4stats.com/articles/popularity/
  14. 14. Estimated R UsageEstimated 250,000 people use it regularly (as of 2009)http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?pagewanted=2&_r=0
  15. 15. General Forum Postingshttp://r4stats.com/articles/popularity/
  16. 16. Stack Overflow Alonehttp://r4stats.com/articles/popularity/
  17. 17. Academic Publicationshttp://r4stats.com/articles/popularity/
  18. 18. Comparison of R, Matlab, SAS, Stata,SPSShttp://www.analyticbridge.com/group/productreviews2/forum/topics/product-reviews-comparing-r-matlab-sas-stata-spss
  19. 19. R Performance
  20. 20. R is Memory-Bound𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒4= 𝐴𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑅 𝐷𝑎𝑡𝑎Source: Joseph B. Rickert, February 14, 201364𝑏𝑖𝑡 𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒 = 𝑅𝐴𝑀32𝑏𝑖𝑡 𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒 = 𝑈𝑠𝑒𝑟 𝑉𝑖𝑟𝑡𝑢𝑎𝑙 𝑀𝑒𝑚𝑜𝑟𝑦 − 0.5𝐺𝐵 ≅ 2 𝐺𝐵Source: http://cran.r-project.org/bin/windows/base/rw-FAQ.html retrieved March 1,2013
  21. 21. R is Memory-BoundAll objects in an R session are stored in memoryR places a limit of 231 − 1 bytes on all object sizes, independent of RAMThe Art of R Programming, Norman Matloff
  22. 22. R Memory ManagementAutomatic including garbage collectionrm()removes object assignment, but does not delete memorygc() forces garbage collection with substantial computation
  23. 23. Improving PerformanceThe Art of R Programming, Chapter 14, Norman MatloffPowerSimplicityVectorization Byte-Code CompilationParallel RC/C++
  24. 24. Improving PerformanceMethod DescriptionC/C++ Call C programs from RVectorization Recode for vectorization replacing slower functionsByte-code compilation cmpfun()Parallel R parallel packagehttp://cran.r-project.org/web/views/HighPerformanceComputing.html
  25. 25. Improving PerformanceRprof()– measures speed of functionsff – memory-efficient storage of large data on disk and fast access functionsbigmemory -- Manage massive matrices with shared memory and memory-mapped files
  26. 26. R for Production Use
  27. 27. Derivative ProjectsRStudio – Integrated Development Environment (IDE)Rattle – Data Mining PackageRExcel – (Statconn) Connection between R and ExcelWeka – Java-based data mining, statistical analysis by RRapidMiner – Java-based Weka data mining, statistical analysis by RRevolution Analytics – Scaling R for the EnterpriseOracle R Enterprise – Integrated into Oracle
  28. 28. About Statconn (as of March 2013)Produces RAndFriends under noncommercial and commercial licensesAll the statconn tools work ONLY with 32-bit RstatconnDCOMrcom (GPL2, but requires statconnDCOM)RExcel 3.2.9 (ONLY 32-bit Office: 2003, 2007, 2010)http://rcom.univie.ac.at/
  29. 29. Sample Projects Using RThe Heritage Health Prize, Thomas NguyenA Direct Marketing In-flight Forecasting System, Shannon Terry & Ben OgorekMining Twitter for Airline Consumer Sentiment, Jeffrey BreenAlternative Data Sources for Measuring Market Sentiment and Events (Using R), JoeRothermich
  30. 30. The Julia Languagehttp://julialang.org/
  31. 31. About JuliaHigh-level, high-performance dynamic open-source programming language for technicalcomputingSyntax similar to other technical computing environmentsFeaturesSophisticated compilerDistributed parallel executionNumerical accuracyExtensive mathematical function libraryUses C, C++, Fortran libraries extensively
  32. 32. Why Julia: “Because we are greedy”http://julialang.org/blog/2012/04/nyc-open-stats-meetup-announcement/
  33. 33. Julia CommunityHosted on github550 mailing list subscribers (Google Groups)1,500 github followers190 forks50 total contributorsAs of September 2012, all contributors except the core developers had knownof the language for six months or lessJulia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski,Shah, Edelman
  34. 34. The Julia Manualhttp://docs.julialang.org/en/latest/manual/
  35. 35. Julia Mathematical Functionshttp://docs.julialang.org/en/latest/manual/mathematical-operations/
  36. 36. Julia Standard Libraryhttp://docs.julialang.org/en/latest/stdlib/
  37. 37. Julia Performance
  38. 38. Key Ingredients of Julia PerformanceRich type information, provided naturally by multiple dispatchAggressive code specialization against run-time typesJulia’s LLVM-based just-in-time (JIT) compilerJulia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski,Shah, Edelman
  39. 39. Julia Performance Comparisonhttp://julialang.org/
  40. 40. Julia Performance ComparisonJulia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski,Shah, Edelman
  41. 41. Julia RecommendationsThe software is ready for people already using C or FortranThe software will develop into a usable scripting language for R usersWait until version one for production use
  42. 42. Send me YourQuestionshttp://marktab.net
  43. 43. ConclusionR provides production-ready software for statistical analysisJulia merits personal investment and promises high performance