O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Introduction to Streaming Distributed Processing with Storm

1.285 visualizações

Publicada em


Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup.

Presented by Brandon O'Brien
Code example: https://github.com/OpenDataMining/brandonobrien
Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/

Publicada em: Dados e análise
  • There are over 16,000 woodworking plans that comes with step-by-step instructions and detailed photos, Click here to take a look ✔✔✔ https://url.cn/xFeBN0O4
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Seja a primeira pessoa a gostar disto

Introduction to Streaming Distributed Processing with Storm

  1. 1. Introduction to Streaming Distributed Processing with Storm Presenter: Brandon O’Brien Data Engineer @ Expedia
  2. 2. Outline  Distributed Systems & Batch Processing  Streaming Processing. Introduce Storm  WordCount Demo & Setup  Storm Cluster Architecture  Storm Topology Architecture  WordCount Deep Dive  Discussion and Q&A: Storm Use Cases & Patterns
  3. 3. Distributed Systems  Distribute work across N nodes  Hadoop Ecosystem  Batch processing  Massively parallel (horizontal scale out)  Problems – data latency, 24 hour batching vs global client base  What’s next? Increasing need to move to real time & streaming processing models
  4. 4. Streaming Processing  Provides near real time views into analytical data sets and system status. Allows for real time intervention & response to events  Streaming frameworks: Spark, Azure Streaming Analytics, AWS Kinesis+Lambda, Storm  Created by Nathan Marz, first used at Twitter  Storm: “Doing for realtime processing what Hadoop did for batch processing”  Stream definition: “unbounded sequence of tuples”
  5. 5. Storm WordCount Demo  WordCount Storm Topology Streams text blobs Counts word occurrences Reporting results each 10 seconds  Getting it running https://github.com/OpenDataMining/brandonobrien mvn clean install exec:java -Dexec.mainClass= "dataclub.storm.TokenCountingTopology”
  6. 6. Storm Cluster Architecture  Core components:  Zookeeper  Nimbus  Supervisors  Workers/JVM  Executor/thread  Component/task (bolts & spouts)  Scalability – can add supervisors while topologies are running, no code change required  Supervisors run Worker JVMs  Workers run Executor Threads  Executors run Tasks (instances of Spouts and Bolts)
  7. 7. Storm Topology Architecture  DAG Processing Model  Directed Acyclic Graph  Components: Spout & Bolt (benefit: decouple logic from scalability)  Tasks (instances of Spouts & Bolts)  Executors (run Tasks)
  8. 8. Storm WordCount Deep Dive  Topology structure  Classes  Spout: SentenceProducer.java  Bolt: SentenceTokenizer.java  Bolt: TokenCounter.java  Putting it all together: TokenCountingTopology.java
  9. 9. Storm Use Cases & Patterns  Consume data from Kafka, Kinesis or other queue  Persist data to high write perf datastore like Cassandra  Streaming map reduce, multi-stage map reduce  Storm is stateless & fail-fast. Externalize state using Redis or other cache for resiliency  Online learning / realtime model updates (using frameworks like WEKA or others)  Real world use cases: Real time ad targeting, travel market analytics, user behavior analytics, system monitoring & SLA  Storm multi lang API (Python, Ruby, PERL, JavaScript, Scala, and more)
  10. 10. Distributed Streaming Processing with Storm  Going Further https://storm.apache.org/ http://storm.apache.org/documentation/Common-patterns.html Frameworks: Trident, Summingbird Stand up Storm cluster: http://www.michael- noll.com/tutorials/running-multi-node-storm-cluster/  Contact Brandon O’Brien, Data Engineer @ Expedia https://www.linkedin.com/in/brandonjobrien  Q&A