Ruby on Big Data @ Philly Ruby Group

Ruby on Big Data
Brian O’Neill
Lead Architect, Health Market Science (HMS)

email: bone@alumni.brown.edu
blog: http://brianoneill.blogspot.com/

The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other
organizations mentioned.

Agenda
Big Data Orientation
Cassandra
Hadoop
SOLR
Storm

DEMO
Java/Ruby Interoperability
Advanced Ideas
Rails Integration
Combing Real-time w/ Batch Processing (The Final Frontier)

“Big” Data
Size doesn’t always matter, it may be
what your doing with it
e.g. Natural-Language Processing

Flexibility was our major motivator
Data sources with disparate schema

Decomposing the
Problem
Data Processing
Storage Distributed

Indexing Batch

Querying Real-time

NoSQL Storage
BASE
Basic Availability
Soft-state
Eventual consistency

Simple API
REST + JSON

Cassandra’s Data Model
Keyspaces

Column Families
Rows
(Sorted by KEY!)

Columns
(Name : Value)

Example
BeerGuys (Keyspace)
Users (Column Families)
bonedog (Row)
firstName : Brian
lastName : O’Neill
lisa (Row)
firstName : Lisa
lastName : O’Neill
maidenName : Kelley

Cassandra Architecture
Ring Architecture A
(N-Z)
Hash(key) -> Node

e.g. md5(“Brian”) = “S”
F
(A-F)
Written to Node A

Client
M
(G-M)

Replication
Fault Tolerance
Written to next N nodes in ring.
A
(S-Z)

Can be made datacenter aware.

S F
e.g. md5(“Brian”) -> “S” (N-S)
(A-F)

Written to Nodes A and L
L
Client
M (G-L)
(L-M)

Consistency Levels
ONE 1st Response

QUORUM N / 2 + 1 Replicas

ALL All Replicas

READ & WRITE

Time & Idempotency
Order Operation Time

1/10/2012
1 INSERT “Brian” into Users @11:15:00
EST

1/10/2012
2 DELETE from Users “Brian” @11:11:00
EST

!
Every mutation is an insert!

a re
ew
er b
Latest timestamp wins.

Bu y

Why NoSQL for us?
Flexibility
A new data processing paradigm
Instead of bringing the data to the processing
(In and Out of a relational database)
Do this:

Processing Data

Batch Processing
DATA

JOB A
Distributable (T-A)

Scalable
Data Locality
S HDFS H
(I-R) (B-G)

Map / Reduce
tuple = (key, value)
map(x) -> tuple[]
reduce(key, value[]) -> tuple[]

Word Count
The Code The Run
def map(doc) doc1 = “boy meets girl”
doc.each do |word| doc2 = ”girl likes boy”)
emit(word, 1)
map (doc1) -> (boy, 1), (meets, 1), (girl, 1)
end
map (doc2) -> (girl, 1), (likes, 1), (boy, 1)
end
reduce (boy, [1, 1]) -> (boy, 2)

def reduce(key, values[]) reduce (girl, [1, 1]) -> (girl, 2)

sum = values.inject {|sum,x| sum + x } reduce (likes [1]) -> (likes, 1)
emit(key, sum) reduce (meets, [1]) -> (meets, 1)
end

Putting it Together
A
(T-A)

S Storm H
(I-R) (B-G)

But...
We love Ruby!
and it’s all in Java. :(

That’s okay,
because
We love REST!

Why Ruby?
Java
cassandra/examples/hadoop_word_count-> find . -name '*.java'
./src/WordCount.java
./src/WordCountCounters.java
./src/WordCountSetup.java
cassandra/examples/hadoop_word_count-> wc -l
495

Ruby
d e!
co
virgil-1.0.5.1-SNAPSHOT/example-> wc -l wordcount.rb
22 wordcount.rb
virgil-1.0.5.1-SNAPSHOT/example-> wc -l demo.sh i n
22 demo.sh
io n
t
d uc
re
9 0%

Virgil: REST Layer
CRUD via HTTP
Map/Reduce via HTTP
A

Client

S H
Storm

Java Interoperability
Conventional Interoperability
I/O Streams bet ween processes

Hadoop Streaming
Storm Multilang

Better?
Use JRuby
Single Process
Parse Once / Eval Many

JSR 223
ScriptEngine ENGINE = new ScriptEngineManager().getEngineByName("jruby");
ScriptContext context = new SimpleScriptContext();
Bindings bindings = context.getBindings(ScriptContext.ENGINE_SCOPE);
bindings.put("variable", "value");
ENGINE.eval(script, context);

Redbridge
this.rubyContainer = new ScriptingContainer(LocalContextScope.CONCURRENT);
this.rubyReceiver = rubyContainer.runScriptlet(script);
container.callMethod(rubyReceiver, "foo", "value");

CRUD via HTTP
http://virgil/data/{keyspace}/{columnFamily}/{column}/{row}
PUT : Replaces Content of Row/Column
GET : Retrieves Value of a Row/Column
DELETE : Removes Value of a Row/Column

A

curl

S H

Map/Reduce over HTTP
wordcount.rb
def map(rowKey, columns)
result = []
columns.each do |column_name, value|
words = value.split A
words.each do |word|
result << [word, "1"]
end
end curl
return result
end

def reduce(key, values)
rows = {}
total = 0
S H
columns = {}
values.each do |value|
total += value.to_i
end
columns["count"] = total.to_s
rows[key] = columns
return rows
end

CF in CF out

hydra = Typhoeus::Hydra.new
while(line = file.gets)
Typhoeus
body = "{ "sentence" : "line" }"
request = Typhoeus::Request.new("http://localhost:8080/virgil/data/dump/#{id}",
:body => body,
:method => :patch,
:headers => {},
:timeout => 5000, # milliseconds
:cache_timeout => 60, # seconds
:params => {})

request.on_complete do |response|
if response.success?
$processed=$processed+1
if ($processed % 1000 == 0) then
puts("Processed #{$processed} records.")
end
elsif response.timed_out?
$time_outs=$time_outs+1
elsif response.code == 0
$faults=$faults+1
else
$failures=$failures+1
end
end

hydra.queue request
end
hydra.run

Rails Integration?
A

Balancer
Load

ta
Da

g
S H

sin
es
oc
Pr
“REST is the new JDBC”

Advance Topics
REAL-TIME PROCESSING

Real-time Processing
Deals with data streams
Storm
tuple Bolt tuple

Spout Bolt
tuple tuple

tuple
Bolt
Spout Bolt
tuple tuple

Bolt

Ratch Processing
(Combing Real-time and Batch)

Data Flows as:
Cascading Map/Reduce jobs
Storm Topologies?

Can’t we have one framework to rule
them all?

Relational Storage
ACID
Atomic: Everything in a transaction succeeds or the
entire transaction is rolled back.
Consistent: A transaction cannot leave the
database in an inconsistent state.
Isolated: Transactions cannot interfere with each
other.
Durable: Completed transactions persist, even when
ser vers restart etc.

Relational Storage
Benefits Limitations
Data Integrity Static Schemas

Ubiquity Scalability

Indexing
Real-time Answers
Full-text queries
Fuzzy Searching

Nickname analysis
Geospatial and Temporal Search

Queries / Flows

Hive
Pig Cascading

Ruby on Big Data @ Philly Ruby Group

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Ruby on Big Data @ Philly Ruby Group

Semelhante a Ruby on Big Data @ Philly Ruby Group (20)

Mais de Brian O'Neill

Mais de Brian O'Neill (8)

Último

Último (20)

Ruby on Big Data @ Philly Ruby Group

Notas do Editor