These are the slides I used in my presentation about Data Science in Ruby during the first Rubyconf Thailand
Really great event!
feel free to send questions
DevEX - reference for building teams, processes, and platforms
Data science in ruby is it possible? is it fast? should we use it?
1. Data Science in
Ruby? Is it possible?
Is it Fast? Should we
use it?
• Rodrigo Urubatan
• rodrigo@urubatan.dev
• http://urubatan.dev
• http://twitter.com/urubatan
2. Anyone here work
with Data Science?
• Data Scientist?
• Data Engineer?
• Developers of application that uses Data?
• Statisticians?
3. What exactly
is Data
Science?
The process of extracting meaning from and interpret
data
The usage of statistics and machine learning to clean
and manipulate data
The usage of computer software to collect, clean,
manipulate and interpret data
A cool name for the combination of Data Mining and
Business Intelligence (other buzz words that were used
for a long time for exactly what we call Data Science
today, but with more expensive tool sets)
5. Can Ruby do
Data Science?
(Long Answer)
INTEGRATION WITH
OTHER TOOLS
DATA
MANIPULATION
DISTRIBUTED
COMPUTING
DATA STRUCTURES
DATA SETS STATISTICS VISUALIZATION INTERACTIVE
COMPUTING
7. Standing on
the shoulders
of giants
(integration)
pycall — Bridge into
the Python world.
rserve-client — Ruby
connector for Rserve,
R's binary server.
8. Data
manipulation
kiba — lightweight Ruby ETL
(Extract-Transform-Load)
framework.
jongleur — Workflow
manager using DAG
definitions to execute ETL
tasks
10. Data
Structures
daru — Data Frame and Vector
structures with comprehensive
manipulating and visualization methods.
numo-narray — n-dimensional
Numerical Array for Ruby.
nmatrix — dense and sparse linear
algebra library for Ruby via SciRuby.
11. Data Sets
rdatasets — Data sets
available in R via Rdatasets.
red-datasets — Growing
collection of publicly
available data sets such as
CIFAR-10, Iris, MNIST etc
12. Statistics
rb-gsl — Ruby interface to the GNU
Scientific Library. [dep: GLS]
simple_stats — Enumerable patches
for descriptive statistics.
enumerable-statistics — fast
implementation of descriptive
statistics for the Enumerable module.
13. Visualization
• matplotlib — Ruby based wrapper
around matplotlib. [dep: matplotlib]
• mathematical — PNG and MathML
renderings for your equations.
• daru-view — daru-view is
interactive plotting gem for web
application (any Ruby web
application framework like
Rails/Sinatra/Nanoc/Hanami) &
IRuby notebook. It is a plugin gem
for daru.
• daru-plotly — Plotly based
visualization for Daru.
21. Ruby and Ruby on Rails are
way better to write business
web applications!
22. We can even do
really good Machine
Learning with Ruby
(but that is subject
for another
presentation)
23. And my objective is to
help ruby developers to use
the best tools for each job so
they can solve hard
problems, with less bugs and
have more free time.
24. pycall to the
rescue
pycall lets you use Python libraries from
your ruby code very naturally, as if you
were calling a Ruby library
pycall consists of one ruby binding
library for libpython.so and an Object-
oriented protocol for communication
between Ruby and Python
26. Ok, so what
are the best
work
patterns?
Python is way better than Ruby for
Data Science
Ruby is better for web business
applications
Best patterns for integration are
(IMHO)
• Pointing both applications to the same
database
• Exchanging data through JSON or some similar
serialization
• Calling Python directly through pycall
27. References
• Ruby Conf 2017 – Using Ruby in Data Science by Kenta Murata (@mrkn)
• Big Data analysis in Ruby
• Lets do some (Data) Science in Ruby by Dan Carpenter (@dan_alyst)
• Progress of Ruby/Numo: Numerical Computing for Ruby
• SciRuby
• Ruby::Numo
• Ruby Machine Learning resources
• Ruby Data Science Resources
• PyCall
28. Any questions? Talk to
me!
• @urubatan
• https://urubatan.dev
• rodrigo@urubatan.dev
29. Other Data
Structure
Libraries
• spreadsheet — manipulation library for MS
Excel spreadsheets
• mdarray — Array structure for Jruby
• cumo — CUDA-aware numerical Array library
with NArray similar interface.
30. Other statistics libraries
statsample — basic and advanced statistics for Ruby. [dep: GLS]
statsample-glm — extension of statsample by Generalized Linear Models.
statsample-bivariate-extension — extension of statsample by Bivariate Correlations.
statsample-timeseries — extension of statsample by Time Series estimators.
pca — Principal Component Analysis (PCA) in Ruby.
descriptive-statistics — descriptive extensions for the Enumerable module or standalone usage.
distribution — probabilistic distributions and descriptive measures for them.
statistics2 — Normal, Chi-square, t- and F- probability distributions for Ruby.
Quick comment of what is data science
1:44s (3:15)
Quick answer: Yes, but let's dive a little into that, since you can do everything, but the answer to if you should deppends on what you want to do43s (3:58)
1:53s (5:51)There are lots of data science libraries for Ruby, for statiscics, data manipulation, data visualization, for integration with python and R, distributed computing, data visualization, machine learning, it appears we have everything we need! But not everything is as great as it seems, lets check some of the options in depth.
45s (6:36)
38s (7:14)
1:14 (8:28)
44s (10:12)
25s (10:37)
28s (11:05)
1:25 (12:30)
1:45 (14:15)
4:48 (19:03)
SciRuby Drawbacks: - Nmatrix is slow for large ammounts of data (there is a bug open for that)- Daru has less functionality than Pandas for practical DS work- There is a lot less documentationBenefits:- You only need Ruby
Nmatrix supports in-memory sparse matrices- You can use Data frames with DaruData frames are the basic data structure to manipulate and visualize living data in data sciencea 2D table data structure like a SQL TableRuby Numo Benefits:
You need only Ruby
Numo::Narray is faster than Nmatrix and pure ruby
Drawbacks
No sparce matrices suport
No data frame support
Even less documented
In Summary for Data Science SciRuby is better because it has Daru, for scientific computing is better because Nmatrix is too slow
But I didn’t forget about RedDataToolsIt supports Apache Arrow and the core developer Kohoei Suto is also a member of Apache Arrow PMCBut it is too young to use in production, and right now it only supports Data I/O, manipulation is not supported
10s (19:13)
54s (20:07)
The most used libraries for data cleaning and transformation in Python are Pandas and Numpy, and we have the corresponding Daru and NMatrix/Narray, but there are some problems, for starters, the documentation of the ruby versions is ages behind the Python libraries, mainly because there are a lot less users.
Also Daru has less features than PandasNMatrix gets slow for big ammounts of dataNarray is lots faster but not compatible with Darubut things are improving
50s (20:57)
1:36s (21:33)
31s (22:04)
51s (22:55)
20s (23:15)
10s (23:25)
15s (23:40)
1:13 (24:53)
1:08s (26:01)
Pycal can work with most python libraries, but to make our lifes easier, it already has wrapers for numpy, pandas, matplotlib, seaborn, scikit-learn, tensorflow, and even wraping python libraries it is a lot faster than using the native Ruby libraries (thanks Kenta Murata for this great project)