Boost Fertility New Invention Ups Success Rates.pdf
Ruby on Big Data @ Philly Ruby Group
1. Ruby on Big Data
Brian O’Neill
Lead Architect, Health Market Science (HMS)
email: bone@alumni.brown.edu
blog: http://brianoneill.blogspot.com/
The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other
organizations mentioned.
2. Agenda
Big Data Orientation
Cassandra
Hadoop
SOLR
Storm
DEMO
Java/Ruby Interoperability
Advanced Ideas
Rails Integration
Combing Real-time w/ Batch Processing (The Final Frontier)
3. “Big” Data
Size doesn’t always matter, it may be
what your doing with it
e.g. Natural-Language Processing
Flexibility was our major motivator
Data sources with disparate schema
4. Decomposing the
Problem
Data Processing
Storage Distributed
Indexing Batch
Querying Real-time
10. Replication
Fault Tolerance
Written to next N nodes in ring.
A
(S-Z)
Can be made datacenter aware.
S F
e.g. md5(“Brian”) -> “S” (N-S)
(A-F)
Written to Nodes A and L
L
Client
M (G-L)
(L-M)
11. Consistency Levels
ONE 1st Response
QUORUM N / 2 + 1 Replicas
ALL All Replicas
READ & WRITE
12. Time & Idempotency
Order Operation Time
1/10/2012
1 INSERT “Brian” into Users @11:15:00
EST
1/10/2012
2 DELETE from Users “Brian” @11:11:00
EST
!
Every mutation is an insert!
a re
ew
er b
Latest timestamp wins.
Bu y
13. Why NoSQL for us?
Flexibility
A new data processing paradigm
Instead of bringing the data to the processing
(In and Out of a relational database)
Do this:
Processing Data
14. Batch Processing
DATA
JOB A
Distributable (T-A)
Scalable
Data Locality
S HDFS H
(I-R) (B-G)
26. Map/Reduce over HTTP
wordcount.rb
def map(rowKey, columns)
result = []
columns.each do |column_name, value|
words = value.split A
words.each do |word|
result << [word, "1"]
end
end curl
return result
end
def reduce(key, values)
rows = {}
total = 0
S H
columns = {}
values.each do |value|
total += value.to_i
end
columns["count"] = total.to_s
rows[key] = columns
return rows
end
CF in CF out
27. hydra = Typhoeus::Hydra.new
while(line = file.gets)
Typhoeus
body = "{ "sentence" : "line" }"
request = Typhoeus::Request.new("http://localhost:8080/virgil/data/dump/#{id}",
:body => body,
:method => :patch,
:headers => {},
:timeout => 5000, # milliseconds
:cache_timeout => 60, # seconds
:params => {})
request.on_complete do |response|
if response.success?
$processed=$processed+1
if ($processed % 1000 == 0) then
puts("Processed #{$processed} records.")
end
elsif response.timed_out?
$time_outs=$time_outs+1
elsif response.code == 0
$faults=$faults+1
else
$failures=$failures+1
end
end
hydra.queue request
end
hydra.run
28. Rails Integration?
A
Balancer
Load
ta
Da
g
S H
sin
es
oc
Pr
“REST is the new JDBC”
31. Ratch Processing
(Combing Real-time and Batch)
Data Flows as:
Cascading Map/Reduce jobs
Storm Topologies?
Can’t we have one framework to rule
them all?
33. Relational Storage
ACID
Atomic: Everything in a transaction succeeds or the
entire transaction is rolled back.
Consistent: A transaction cannot leave the
database in an inconsistent state.
Isolated: Transactions cannot interfere with each
other.
Durable: Completed transactions persist, even when
ser vers restart etc.