O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Cassandra and Storm at Health Market Sceince

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Storm
Storm
Carregando em…3
×

Confira estes a seguir

1 de 36 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Quem viu também gostou (12)

Anúncio

Semelhante a Cassandra and Storm at Health Market Sceince (20)

Mais recentes (20)

Anúncio

Cassandra and Storm at Health Market Sceince

  1. 1. P. Taylor Goetz Development Lead, Health Market Science tgoetz@healthmarketscience.com @ptgoetz
  2. 2. Agenda  Cassandra @ HMS  Storm Overview  Storm-Cassandra  Examples / Demo  Future : Trident
  3. 3. Our Products  Master Data Management  Good, bad doctors?  Prescriber eligibility and remediation.
  4. 4. Cassandra to the Rescue 1000’s of Feeds C* Masterfile Big Data for us == Variety of Data Δt
  5. 5. But…  Search unstructured data  Real-time Analytics / Reporting  Transactional Processing  Changes reflected immediately.  Wide-row Indexes
  6. 6. What might that look like? wide-row index C* I’m happy RDBMS Provide for Polyglot Persistence
  7. 7. What we did wrong…  Could not react to transactional changes  Needed extra logic to track what changed  Took too long
  8. 8. C* at HMS  Load, Standardize, Match, Consolidate  Write results to C*  Track Changes over Time  Practitioner Data  Feed Quality
  9. 9. C* at HMS  Best Practices  Prefer Write over Read (esp. Read before Write)  Avoid Queries/Scans ○ Fetch by key whenever possible ○ Put Comparators to work ○ Pre-compute whenever possible
  10. 10. How?  Treat All Data as Immutable  Updates are inserts with new version/timestamp  Data Model  Heavy use of composites  Timestamps/Versions in Keys  Treat Feeds as Real-Time Streams
  11. 11. What Storm is to us… Crud Op ETL Dimensional Enrichment Counts Fuzzy SoR RDBMS Index A High Throughput Data Processing Pipeline
  12. 12. Storm Overview  Open-Sourced by Twitter in 2011  Distributed Realtime Computation System  Fault Tolerant  Highly Scalable  Guaranteed Processing  Operates on one or more streams of data
  13. 13. Anatomy of a Storm Cluster  Nimbus  Master Node  Zookeeper  Cluster Coordination  Supervisors  Worker Nodes
  14. 14. Storm Primatives  Streams  Unbounded sequence of tuples  Spouts  Stream Sources  Bolts  Unit of Computation  Topologies  Combination of n Spouts and n Bolts  Defines the overall “Computation”
  15. 15. Storm Spouts  Represents a source (stream) of data  Queues (JMS, Kafka, Kestrel, etc.)  Twitter Firehose  Sensor Data  Emits “Tuples” (Events) based on source  Primary Storm data structure  Set of Key-Value pairs
  16. 16. Storm Bolts  Receive Tuples from Spouts or other Bolts  Operate on, or React to Data  Functions/Filters/Joins/Aggregations  Database writes/lookups  Optionally emit additional Tuples
  17. 17. Storm Topologies  Data flow between spouts and bolts  Routing of Tuples between spouts/bolts  Stream “Groupings”  Parallelism of Components  Long-Lived
  18. 18. Storm Topologies
  19. 19. Storm and Cassandra  Use Cases:  Write Storm Tuple data to C* ○ Computation Results ○ Pre-computed indices  Read data from C* and emit Storm Tuples ○ Dynamic Lookups http://github.com/hmsonline/storm-cassandra
  20. 20. Storm Cassandra Bolt Types CassandraBolt Cassandra LookupBolt C*  CassandraBolt  Writes data to Cassandra  Available in Batching and Non-Batching  CassandraLookupBolt  Reads data from Cassandra http://github.com/hmsonline/storm-cassandra
  21. 21. Storm-Cassandra Project  Provides generic Bolts for writing/reading Storm Tuples to/from C* Tuple Tuple Mapper Rows Tuples Columns Mapper Columns C* http://github.com/hmsonline/storm-cassandra
  22. 22. Storm-Cassandra Project  TupleMapper Interface  Tells the CassandraBolt how to write a tuple to an arbitrary data model  Given a Storm Tuple:  Map to Column Family  Map to Row Key  Map to Columns http://github.com/hmsonline/storm-cassandra
  23. 23. Storm-Cassandra Project  ColumnsMapper Interface  Tells the CassandraLookupBolt how to transform a C* row into a Storm Tuple  Given a C* Row Key and list of Columns:  Return a list of Storm Tuples http://github.com/hmsonline/storm-cassandra
  24. 24. Storm-Cassandra Project  Current State:  Version 0.4.0  Uses Astyanax Client  Several out-of-the-box *Mapper Implementations: ○ Basic Key-Value Columns ○ Value-less Columns ○ Counter Columns ○ Lookup by row key ○ Lookup by range query  Composite Key/Column Support  Trident support http://github.com/hmsonline/storm-cassandra
  25. 25. Storm-Cassandra Project  Future Plans:  Switch to CQL  Enhanced Trident Support http://github.com/hmsonline/storm-cassandra
  26. 26. Word Count Demo http://github.com/hmsonline/storm-cassandra
  27. 27. DRPC
  28. 28. Reach Demo
  29. 29. Trident  Provides a higher-level abstraction for stream processing  Constructs for state management and Batching  Adds additional primitives that abstract away common topological patterns  Deprecates transactional topologies  Distributes with Storm
  30. 30. Sample Trident Operations  Partition Local  Functions ( execute(x)  x + y )  Filters ( isKeep(x)  0,x )  PartitionAggregate ○ Combiner ( pairwise combining ) ○ Reducer ( iterative accumulation ) ○ Aggregator ( byoa)
  31. 31. A sample topology TridentTopology topology = new TridentTopology(); TridentStatewordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate( MemcachedState.opaque(serverLocations), new Count(), new Fields("count")) .parallelismHint(6); https://github.com/nathanmarz/storm/wiki/Trident-state
  32. 32. Trident State Sequenced writes by batch/transaction id.  Spouts  Transactional ○ Batch contents never change  Opaque ○ Batch contents can change  State  Transactional ○ Store tx_id with counts to maintain sequencing of writes.  Opaque ○ Store previous value in order to overwrite the current value when contents of a batch change.
  33. 33. Shameless Shoutouts  HMS (https://github.com/hmsonline/)  storm-cassandra  storm-elastic-search  storm-jdbi (coming soon)  ptgoetz (https://github.com/ptgoetz)  storm-jms  storm-signals
  34. 34. P. Taylor Goetz Development Lead, Health Market Science tgoetz@healthmarketscience.com @ptgoetz

×