Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Integrating Non-Reactive
Legacy Code - The Case of R
!
!
!
!
!
!
Marek Kolodziej
Machine Learning Engineer
!
!
!
!
!
!
!
SF Scala Meetup, Sep. 10, 2014

Reactive Recap
Eve!nt-‐driven
-‐ Asynchronous
-‐ Non-‐blocking
-‐ Op4mized
around
Amdahl’s
Law
Scalable
-‐ Loca4on
transparency
(up
and
out)
-‐ Factor
in
unreliable
network
!
Resilient
-‐ Failure
isola4on
(bulkhead
paAern,
etc.)
-‐ Clean
service
and
failure
handling
separa4on
(supervision)
Responsive
-‐ Minimize
latency
-‐ Deal
with
bursty
traffic
-‐ Gracefully
handle
conges4on
(backpressure/ac4ve
pull
by
subscriber)
< <
07

The tough reality
Not everything’s under your control
Not
everything’s
an
actor
-‐ Legacy
Java/Scala
code
-‐ Third-‐Party
Libraries
Blo! cking
calls
-‐ Database
queries
-‐ Calls
to
services
-‐ Non-‐threaded
run4mes
(R)
!
!
!Long-‐running
jobs
-‐ Resource
clean-‐up
in
case
network
par44on
occurs
way
before
the
4me-‐out
is
reached
-‐ Timeouts
vs.
heartbeats
!
Not
all
failures
are
within
th!e
JVM
-‐ Can
we
revive
them
from
within
the
JVM?
!
!
< <
07

Alpine’s R Operator
< <
07

!
!
!
!
!
!
!
!
!
!
!
!
!
For
Alpine’s R Operator
!
!
!
!
!
!
!
!
!
!
!
!
!
!
The cases for and against R
-‐ 5,000+
sta4s4cal
and
machine
learning
libraries
-‐ “[Numeric]
gold
standard”
implementa4ons
-‐ Operator
would
allow
arbitrary
processing
in
a
“canned”
applica4on
-‐ Data
scien4sts
already
know
the
language
-‐ Support
for
client’s
exis4ng
code
base
(100s
of
scripts)
-‐ Very
rapid
prototyping
-‐ Focus
on
science
instead
of
coding
!
< <
07
Against
-‐ Slow
run4me
(even
with
JIT)
-‐ Memory
hogging
(by-‐copy
seman4cs)
-‐ Very
slow
garbage
collec4on
-‐ Single-‐threaded
run4me
(even
worse
than
Python
and
Ruby)
-‐ Na4ve
libraries
wriAen
by
people
without
much
CS/
engineering
background
(segfaults,
etc.)
-‐ Buggy
libraries
(infinite
loops,
etc.)
-‐ Run4me
crashes
-‐ Terrible
handling
of
big
datasets

Lice! nsing
Issues
Challenges
!
!
!
!
!
!
!
!
!
-‐ Need
a
cluster
of
R
workers
(mul4-‐user,
mul4-‐operator
concurrency
given
a
single-‐
threaded
R
run4me)
!
-‐ REST
is
good
for
data
but
preAy
bad
for
control
(some
structure
would
be
nice)
!
-‐ Sessions
or
backpressure
!
!
-‐ R
is
GPL
-‐ RServe
is
(L)GPL
-‐ Shipped
soaware
(GPL
SaaS
loophole
doesn’t
apply)
Distributed
compuHng
< <
07
Fa!ult
tolerance
-‐ R
run4me
failures
-‐ Network
par44ons
(R
session
clean-‐up)
!
!

!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Licensing
Issues
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Solutions
-‐ Akka
is
Apache
2.0
-‐ RServe
is
(L)GPL
-‐ Can
open-‐source
the
R-‐Java
server
bridge
-‐ Communica4on
to
Alpine
backend
via
(open-‐source)
message
case
classes
Distributed
compuHng
-‐ Akka’s
loca4on
transparency
is
ideal
for
distribu4ng
work
-‐ Cluster
API
would
have
been
preferred
but
Alpine
uses
Akka
2.2.3
due
to
Spark
dependency
-‐ Structure
and
seman4cs
due
to
message
case
classes
-‐ Rx
streams
would
have
been
nice
for
backpressure,
but
we
have
an
old
Akka
version
(so
sessions)
!
< <
07
Fault
tolerance
-‐ Rserve
forks
R
processes.
Exc.
handling
of
the
Connec4on
object
lets
you
restart
processes.
-‐ Akka’s
heartbeat
allows
session
clean-‐up
in
case
of
network
failure
before
4me-‐
out
(important
if
4me-‐out
is
~1
day).
-‐ Event
bus
lets
you
observe
failure
to
connect
to
remote
actor
system.
-‐ No
need
for
exactly
once
seman4cs
(the
user
can
re-‐run
the
flow),
but
you
have
to
know
that
the
failure
occurred.
!
!
!

!
!
!
!
!
!
!
!
-‐ Arguably
the
ugliest
part
of
the
solu4on
(can
be
replaced
with
alterna4ves)
-‐ Worker
actors
blocked
for
long
periods
(hours).
-‐ Large
data
blocks
are
sent
to
the
Akka
R
server
(~
128
MB).
-‐ No
backpressure
via
Rx
streams
since
it’s
Akka
2.3.2.
-‐ Custom
router
-‐
refuses
requests
if
all
workers
are
busy.
-‐ Client
needs
to
respond
to
request
refusal
by
awai4ng
a
free
worker
message
(reac4ve
but
inelegant).
-‐ BeAer
solu4on
-‐
use
reac4ve
streams
(we
need
to
upgrade
Akka)
-‐ Improvement:
use
Akka
for
control
but
REST
for
data
movement
!
!
!
!
!
!
!
!
!
!
!
!
Sessions
Solutions
< <
07

Future Improvements
-‐ Data
movement
via
REST
!
-‐ Replacement
of
sessions
via
reac4ve
streams
(Akka
upgrade!)
!
-‐ Kamon
test
drive
for
distributed
actors
(released
~2
weeks
ago)
!
!
!
!
< <
07

Conclusions
!
!
!
!
!
!
!
!
-‐ Akka
makes
even
non-‐reac4ve
distributed
programming
easier
and
more
reliable
!
-‐ If
you
can,
use
the
latest
Akka
version
because
a
lot
of
the
earlier
pain
can
be
avoided:
-‐
clustering
-‐
persistence
-‐
reac4ve
streams
!
-‐ Large
data
movement
via
Akka
is
probably
not
an
ideal
use
of
the
framework:
-‐
use
REST
(including
Spray,
Play,
etc.)
and
HTTP
chunking
-‐
move
the
data
directly
using
NeAy,
etc.
!
!
!
!
!
!
!
!
< <
07

Miscellaneous
!
!
!
!
!
!
!
!
-‐ Alpine
is
hiring
-‐
machine
learning
engineers
(Scala/Java)
-‐
data
scien4sts
(R/Python)
-‐
Front
end
developers
(Ruby
on
Rails)
!
-‐
ScalaCourses.com
is
looking
for
reviewers:
-‐
Scala
(beginner/intermediate)
-‐
Akka
-‐
Play
-‐
Java
Interop.
-‐
contact
Michael
Slinn:
mslinn@scalacourses.com
!
!
!
!
!
!
!
< <
07

Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator

Semelhante a Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator (20)

Mais de alpinedatalabs

Mais de alpinedatalabs (6)

Último

Último (20)

Integrating R and the JVM Platform - Alpine Data Labs' R Execute Operator