How do you combine comprehensive analysis running on large amount of data with the demand for responsiveness of today's api services?
This talk illustrates one of recipes that we currently use at ING to tackle this problem. Our analytical stack combines machine learning algorithms running on hadoop cluster and api services executed by an akka cluster.
Cassandra is used as a 'latency adapter' between the fast and the slow path. Our api services are executed by the akka/spray layer. Those services consume both live data sources as well as intermediate results as promoted by the hadoop layer via cassandra. This approach allows us to provide internal api services which are both complete and responsive.
10. >>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>> iris = load_iris()
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(iris.data, iris.target)
● Flexible, coincise language
● Quick to code and prototype
● Portable, visualization libraries
Machine learning libraries:
scipy, statsmodels, sklearn,
matplotlib, ipython
Web libraries
flask, tornado, (no)SQL clients
11. # Multiple Linear Regression Example
fit <- lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit) # show results
● Language for statitics
● Easy to Analyze and shape data
● Advanced statistical package
● Fueled by academia and professionals
● Very clean visualization packages
Packages for machine learning
time serie forecasting, clustering, classification
decision trees, neural networks
Remote procedure calls (RPC)
From scala/java via RProcess and Rserve
21. Sprayin’
trait ApiService extends HttpService {
// Create Analytics client actor
val actor = actorRefFactory.actorOf(Props[AnalyticsActor], "analytics-actor")
//curl -vv -H "Content-Type: application/json" localhost:8888/api/v1/123/567
val serviceRoute = {
pathPrefix("api" / "v1") {
pathPrefix( Segment / Segment ) {
(aid, cid) =>
get {
complete {
actor ? (aid, cid)
Create an actor
for analytics
Serve the
API path
Message is passed on to
the analytics actor
https://github.com/natalinobusa/wavr
24. Science & Engineering
Statistics,
Data Science
Python
R
Visualization
IT Infra
Big Data
Java
Scala
SQL
Hadoop: Big Data Infrastructure, Data Science on large datasets
Big Data and Fast Data
requires different profiles to be able to
achieve the best results
25. Some lessons learned
● Mix and match technologies is a good thing
● Harden the design as you go
● Define clear interfaces
● Ease integration among teams
● Hadoop , Cassandra, and Akka: they work!
● Plugin the Data Science !