SlideShare uma empresa Scribd logo
1 de 38
Ruby on Big Data
                 Brian O’Neill
Lead Architect, Health Market Science (HMS)

                      email: bone@alumni.brown.edu
                      blog: http://brianoneill.blogspot.com/




     The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other
                                             organizations mentioned.
Agenda
Big Data Orientation
  Cassandra
  Hadoop
  SOLR
  Storm

DEMO
Java/Ruby Interoperability
Advanced Ideas
  Rails Integration
  Combing Real-time w/ Batch Processing (The Final Frontier)
“Big” Data
Size doesn’t always matter, it may be
what your doing with it
 e.g. Natural-Language Processing

Flexibility was our major motivator
 Data sources with disparate schema
Decomposing the
       Problem
Data         Processing
 Storage     Distributed

 Indexing    Batch

 Querying    Real-time
NoSQL Storage
BASE
 Basic Availability
 Soft-state
 Eventual consistency

Simple API
 REST + JSON
Heritage
Cassandra’s Data Model
   Keyspaces

     Column Families
                 Rows
               (Sorted by KEY!)


                       Columns
                         (Name : Value)
Example
BeerGuys (Keyspace)
  Users (Column Families)
     bonedog (Row)
        firstName : Brian
        lastName : O’Neill
     lisa (Row)
        firstName : Lisa
        lastName : O’Neill
        maidenName : Kelley
Cassandra Architecture
Ring Architecture              A
                              (N-Z)
 Hash(key) -> Node

   e.g. md5(“Brian”) = “S”
                                       F
                                      (A-F)
   Written to Node A



                     Client
                              M
                              (G-M)
Replication
Fault Tolerance
 Written to next N nodes in ring.
                                             A
                                            (S-Z)

 Can be made datacenter aware.

                                     S                       F
e.g. md5(“Brian”) -> “S”            (N-S)
                                                            (A-F)


Written to Nodes A and L
                                                     L
                           Client
                                            M       (G-L)
                                            (L-M)
Consistency Levels
 ONE           1st Response


QUORUM       N / 2 + 1 Replicas


 ALL            All Replicas


         READ & WRITE
Time & Idempotency
Order            Operation              Time

                                      1/10/2012
 1        INSERT “Brian” into Users   @11:15:00
                                         EST

                                      1/10/2012
 2        DELETE from Users “Brian”   @11:11:00
                                         EST




                                                         !
        Every mutation is an insert!




                                                      a re
                                                   ew
                                                  er b
        Latest timestamp wins.



                                               Bu y
Why NoSQL for us?
Flexibility
A new data processing paradigm
  Instead of bringing the data to the processing
    (In and Out of a relational database)
  Do this:

       Processing              Data
Batch Processing
                      DATA

                JOB           A
Distributable                 (T-A)


Scalable
Data Locality
                       S      HDFS     H
                      (I-R)           (B-G)
Map / Reduce
tuple = (key, value)
map(x) -> tuple[]
reduce(key, value[]) -> tuple[]
Word Count
The Code                                   The Run
def map(doc)                                   doc1 = “boy meets girl”
 doc.each do |word|                            doc2 = ”girl likes boy”)
      emit(word, 1)
                                               map (doc1) -> (boy, 1), (meets, 1), (girl, 1)
  end
                                               map (doc2) -> (girl, 1), (likes, 1), (boy, 1)
end
                                               reduce (boy, [1, 1]) -> (boy, 2)

def reduce(key, values[])                      reduce (girl, [1, 1]) -> (girl, 2)

  sum = values.inject {|sum,x| sum + x }       reduce (likes [1]) -> (likes, 1)
  emit(key, sum)                               reduce (meets, [1]) -> (meets, 1)
end
Putting it Together
          A
          (T-A)




  S      Storm     H
 (I-R)            (B-G)
But...
We love Ruby!
 and it’s all in Java. :(


That’s okay,
  because
We love REST!
Why Rest?


Client




???
Why Ruby?
Java
cassandra/examples/hadoop_word_count-> find . -name '*.java'
./src/WordCount.java
./src/WordCountCounters.java
./src/WordCountSetup.java
cassandra/examples/hadoop_word_count-> wc -l
      495


Ruby
                                                                                          d e!
                                                                                     co
virgil-1.0.5.1-SNAPSHOT/example-> wc -l wordcount.rb
        22 wordcount.rb
virgil-1.0.5.1-SNAPSHOT/example-> wc -l demo.sh                                i   n
        22 demo.sh
                                                                        io   n
                                                                      t
                                                               d uc
                                                          re
                                                   9 0%
Virgil: REST Layer
         CRUD via HTTP
         Map/Reduce via HTTP
                                A

Client



                         S             H
                               Storm
DEMO
Java Interoperability
Conventional Interoperability
 I/O Streams bet ween processes



Hadoop Streaming
Storm Multilang
Better?
                             Use JRuby
                                 Single Process
                                 Parse Once / Eval Many

JSR 223
    ScriptEngine ENGINE = new ScriptEngineManager().getEngineByName("jruby");
    ScriptContext context = new SimpleScriptContext();
    Bindings bindings = context.getBindings(ScriptContext.ENGINE_SCOPE);
    bindings.put("variable", "value");
    ENGINE.eval(script, context);



Redbridge
    this.rubyContainer = new ScriptingContainer(LocalContextScope.CONCURRENT);
    this.rubyReceiver = rubyContainer.runScriptlet(script);
    container.callMethod(rubyReceiver, "foo", "value");
CRUD via HTTP
http://virgil/data/{keyspace}/{columnFamily}/{column}/{row}
                    PUT : Replaces Content of Row/Column
                    GET : Retrieves Value of a Row/Column
                    DELETE : Removes Value of a Row/Column


                                                    A




             curl


                                             S               H
Map/Reduce over HTTP
       wordcount.rb
def map(rowKey, columns)
    result = []
    columns.each do |column_name, value|
        words = value.split                              A
        words.each do |word|
            result << [word, "1"]
        end
    end                                    curl
    return result
end

def reduce(key, values)
    rows = {}
    total = 0
                                                     S            H
    columns = {}
    values.each do |value|
        total += value.to_i
    end
    columns["count"] = total.to_s
    rows[key] = columns
    return rows
end

                                             CF in           CF out
hydra = Typhoeus::Hydra.new
while(line = file.gets)
                               Typhoeus
  body = "{ "sentence" : "line" }"
  request = Typhoeus::Request.new("http://localhost:8080/virgil/data/dump/#{id}",
                                :body          => body,
                                :method        => :patch,
                                :headers       => {},
                                :timeout       => 5000, # milliseconds
                                :cache_timeout => 60, # seconds
                                :params        => {})

  request.on_complete do |response|
    if response.success?
      $processed=$processed+1
      if ($processed % 1000 == 0) then
         puts("Processed #{$processed} records.")
      end
    elsif response.timed_out?
      $time_outs=$time_outs+1
    elsif response.code == 0
      $faults=$faults+1
    else
      $failures=$failures+1
    end
  end

  hydra.queue request
end
hydra.run
Rails Integration?
                             A




         Balancer
           Load




                        ta
                        Da




                                      g
                    S                     H




                                  sin
                                 es
                             oc
                             Pr
   “REST is the new JDBC”
Advance Topics
  REAL-TIME PROCESSING
Real-time Processing
Deals with data streams
                                    Storm
          tuple   Bolt   tuple

  Spout                          Bolt
          tuple          tuple



          tuple
                  Bolt
  Spout                          Bolt
          tuple          tuple

                  Bolt
Ratch Processing
  (Combing Real-time and Batch)


Data Flows as:
 Cascading Map/Reduce jobs
 Storm Topologies?

Can’t we have one framework to rule
them all?
Appendix
Relational Storage
ACID
 Atomic: Everything in a transaction succeeds or the
 entire transaction is rolled back.
 Consistent: A transaction cannot leave the
 database in an inconsistent state.
 Isolated: Transactions cannot interfere with each
 other.
 Durable: Completed transactions persist, even when
 ser vers restart etc.
Relational Storage
Benefits          Limitations
 Data Integrity   Static Schemas

 Ubiquity         Scalability
Indexing
Real-time Answers
Full-text queries
 Fuzzy Searching

Nickname analysis
Geospatial and Temporal Search
Storage Options
Queries / Flows


      Hive
Pig          Cascading
Indexing Options

Mais conteúdo relacionado

Mais procurados

MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
Thejas Nair
 

Mais procurados (20)

Scalding
ScaldingScalding
Scalding
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Scala+data
Scala+dataScala+data
Scala+data
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 

Semelhante a Ruby on Big Data @ Philly Ruby Group

Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
BugSense
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
Radek Maciaszek
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 

Semelhante a Ruby on Big Data @ Philly Ruby Group (20)

Neo4j
Neo4jNeo4j
Neo4j
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
 
3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DB
3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DB3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DB
3rd Athens Big Data Meetup - 2nd Talk - Neo4j: The World's Leading Graph DB
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Implementing a many-to-many Relationship with Slick
Implementing a many-to-many Relationship with SlickImplementing a many-to-many Relationship with Slick
Implementing a many-to-many Relationship with Slick
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Seattle useR Group - R + Scala
Seattle useR Group - R + ScalaSeattle useR Group - R + Scala
Seattle useR Group - R + Scala
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
2015 02-09 - NoSQL Vorlesung Mosbach
2015 02-09 - NoSQL Vorlesung Mosbach2015 02-09 - NoSQL Vorlesung Mosbach
2015 02-09 - NoSQL Vorlesung Mosbach
 
From Java to Parellel Clojure - Clojure South 2019
From Java to Parellel Clojure - Clojure South 2019From Java to Parellel Clojure - Clojure South 2019
From Java to Parellel Clojure - Clojure South 2019
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
Securerank ping-opendns
Securerank ping-opendnsSecurerank ping-opendns
Securerank ping-opendns
 
Big Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your BrowserBig Data Scala by the Bay: Interactive Spark in your Browser
Big Data Scala by the Bay: Interactive Spark in your Browser
 
Generic Functional Programming with Type Classes
Generic Functional Programming with Type ClassesGeneric Functional Programming with Type Classes
Generic Functional Programming with Type Classes
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 

Mais de Brian O'Neill (8)

Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Big data philly_jug
Big data philly_jugBig data philly_jug
Big data philly_jug
 
The Art of Platform Development
The Art of Platform DevelopmentThe Art of Platform Development
The Art of Platform Development
 
Hms nyc* talk
Hms nyc* talkHms nyc* talk
Hms nyc* talk
 
Collaborative software development
Collaborative software developmentCollaborative software development
Collaborative software development
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Ruby on Big Data @ Philly Ruby Group

  • 1. Ruby on Big Data Brian O’Neill Lead Architect, Health Market Science (HMS) email: bone@alumni.brown.edu blog: http://brianoneill.blogspot.com/ The views expressed herein are those of my own and do not necessarily reflect the views of HMS or other organizations mentioned.
  • 2. Agenda Big Data Orientation Cassandra Hadoop SOLR Storm DEMO Java/Ruby Interoperability Advanced Ideas Rails Integration Combing Real-time w/ Batch Processing (The Final Frontier)
  • 3. “Big” Data Size doesn’t always matter, it may be what your doing with it e.g. Natural-Language Processing Flexibility was our major motivator Data sources with disparate schema
  • 4. Decomposing the Problem Data Processing Storage Distributed Indexing Batch Querying Real-time
  • 5. NoSQL Storage BASE Basic Availability Soft-state Eventual consistency Simple API REST + JSON
  • 7. Cassandra’s Data Model Keyspaces Column Families Rows (Sorted by KEY!) Columns (Name : Value)
  • 8. Example BeerGuys (Keyspace) Users (Column Families) bonedog (Row) firstName : Brian lastName : O’Neill lisa (Row) firstName : Lisa lastName : O’Neill maidenName : Kelley
  • 9. Cassandra Architecture Ring Architecture A (N-Z) Hash(key) -> Node e.g. md5(“Brian”) = “S” F (A-F) Written to Node A Client M (G-M)
  • 10. Replication Fault Tolerance Written to next N nodes in ring. A (S-Z) Can be made datacenter aware. S F e.g. md5(“Brian”) -> “S” (N-S) (A-F) Written to Nodes A and L L Client M (G-L) (L-M)
  • 11. Consistency Levels ONE 1st Response QUORUM N / 2 + 1 Replicas ALL All Replicas READ & WRITE
  • 12. Time & Idempotency Order Operation Time 1/10/2012 1 INSERT “Brian” into Users @11:15:00 EST 1/10/2012 2 DELETE from Users “Brian” @11:11:00 EST ! Every mutation is an insert! a re ew er b Latest timestamp wins. Bu y
  • 13. Why NoSQL for us? Flexibility A new data processing paradigm Instead of bringing the data to the processing (In and Out of a relational database) Do this: Processing Data
  • 14. Batch Processing DATA JOB A Distributable (T-A) Scalable Data Locality S HDFS H (I-R) (B-G)
  • 15. Map / Reduce tuple = (key, value) map(x) -> tuple[] reduce(key, value[]) -> tuple[]
  • 16. Word Count The Code The Run def map(doc) doc1 = “boy meets girl” doc.each do |word| doc2 = ”girl likes boy”) emit(word, 1) map (doc1) -> (boy, 1), (meets, 1), (girl, 1) end map (doc2) -> (girl, 1), (likes, 1), (boy, 1) end reduce (boy, [1, 1]) -> (boy, 2) def reduce(key, values[]) reduce (girl, [1, 1]) -> (girl, 2) sum = values.inject {|sum,x| sum + x } reduce (likes [1]) -> (likes, 1) emit(key, sum) reduce (meets, [1]) -> (meets, 1) end
  • 17. Putting it Together A (T-A) S Storm H (I-R) (B-G)
  • 18. But... We love Ruby! and it’s all in Java. :( That’s okay, because We love REST!
  • 20. Why Ruby? Java cassandra/examples/hadoop_word_count-> find . -name '*.java' ./src/WordCount.java ./src/WordCountCounters.java ./src/WordCountSetup.java cassandra/examples/hadoop_word_count-> wc -l 495 Ruby d e! co virgil-1.0.5.1-SNAPSHOT/example-> wc -l wordcount.rb 22 wordcount.rb virgil-1.0.5.1-SNAPSHOT/example-> wc -l demo.sh i n 22 demo.sh io n t d uc re 9 0%
  • 21. Virgil: REST Layer CRUD via HTTP Map/Reduce via HTTP A Client S H Storm
  • 22. DEMO
  • 23. Java Interoperability Conventional Interoperability I/O Streams bet ween processes Hadoop Streaming Storm Multilang
  • 24. Better? Use JRuby Single Process Parse Once / Eval Many JSR 223 ScriptEngine ENGINE = new ScriptEngineManager().getEngineByName("jruby"); ScriptContext context = new SimpleScriptContext(); Bindings bindings = context.getBindings(ScriptContext.ENGINE_SCOPE); bindings.put("variable", "value"); ENGINE.eval(script, context); Redbridge this.rubyContainer = new ScriptingContainer(LocalContextScope.CONCURRENT); this.rubyReceiver = rubyContainer.runScriptlet(script); container.callMethod(rubyReceiver, "foo", "value");
  • 25. CRUD via HTTP http://virgil/data/{keyspace}/{columnFamily}/{column}/{row} PUT : Replaces Content of Row/Column GET : Retrieves Value of a Row/Column DELETE : Removes Value of a Row/Column A curl S H
  • 26. Map/Reduce over HTTP wordcount.rb def map(rowKey, columns) result = [] columns.each do |column_name, value| words = value.split A words.each do |word| result << [word, "1"] end end curl return result end def reduce(key, values) rows = {} total = 0 S H columns = {} values.each do |value| total += value.to_i end columns["count"] = total.to_s rows[key] = columns return rows end CF in CF out
  • 27. hydra = Typhoeus::Hydra.new while(line = file.gets) Typhoeus body = "{ "sentence" : "line" }" request = Typhoeus::Request.new("http://localhost:8080/virgil/data/dump/#{id}", :body => body, :method => :patch, :headers => {}, :timeout => 5000, # milliseconds :cache_timeout => 60, # seconds :params => {}) request.on_complete do |response| if response.success? $processed=$processed+1 if ($processed % 1000 == 0) then puts("Processed #{$processed} records.") end elsif response.timed_out? $time_outs=$time_outs+1 elsif response.code == 0 $faults=$faults+1 else $failures=$failures+1 end end hydra.queue request end hydra.run
  • 28. Rails Integration? A Balancer Load ta Da g S H sin es oc Pr “REST is the new JDBC”
  • 29. Advance Topics REAL-TIME PROCESSING
  • 30. Real-time Processing Deals with data streams Storm tuple Bolt tuple Spout Bolt tuple tuple tuple Bolt Spout Bolt tuple tuple Bolt
  • 31. Ratch Processing (Combing Real-time and Batch) Data Flows as: Cascading Map/Reduce jobs Storm Topologies? Can’t we have one framework to rule them all?
  • 33. Relational Storage ACID Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other. Durable: Completed transactions persist, even when ser vers restart etc.
  • 34. Relational Storage Benefits Limitations Data Integrity Static Schemas Ubiquity Scalability
  • 35. Indexing Real-time Answers Full-text queries Fuzzy Searching Nickname analysis Geospatial and Temporal Search
  • 37. Queries / Flows Hive Pig Cascading

Notas do Editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n