Anúncio
Anúncio

Mais conteúdo relacionado

Apresentações para você(20)

Destaque(20)

Anúncio

Similar a Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows(20)

Anúncio

Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows

  1. @helenaedelson #kafkasummit 1 Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign Management for Globally Distributed Data Flows Helena Edelson @helenaedelson Kafka Summit 2016
  2. @helenaedelson #kafkasummit VP of Engineering, Tuplejump Previously: Sr Cloud / Big Data / Analytics Engineer: DataStax, CrowdStrike, VMware, SpringSource... Event-Driven systems, Analytics, Machine Learning, Scala Committer: Kafka Connect Cassandra, Spark Cassandra Connector Contributor: Akka, previously: Spring Integration Speaker: Kafka Summit, Spark Summit, Strata, QCon, Scala Days, Scala World, Philly ETE 2 twitter.com/helenaedelson github.com/helena slideshare.net/helenaedelson
  3. @helenaedelson #kafkasummit The Real Topic 3 http://www.slideshare.net/palvaro/ricon-keynote-outwards-from-the-middle-of-the-maze/42
  4. @helenaedelson #kafkasummit Chaos Of Distribution One of the more fascinating problems is that of solving the chaos of distributed systems. Regardless of the domain. 4
  5. @helenaedelson #kafkasummit Aproaching this within the use case of: High-Level Landscape Platform & Infrastructure Strategies and Patterns Four-Letter Acronyms Can't Touch This Architecture 5
  6. @helenaedelson #kafkasummit 6 The Landscape
  7. @helenaedelson #kafkasummit 7 The Digital Ad Industry
  8. @helenaedelson #kafkasummit An RTB Drive-By Real time auction for ad spaces, all devices High throughput, low-Latency (similar to FIN Tech but not quite) OpenRTB API Spec - but not everyone uses it 8 Open protocol for automated trading of digital media across platforms, devices, and advertising solutions
  9. @helenaedelson #kafkasummit 9 Ad Delivered to User In A Nutshell User hits a Publisher's page Advertiser Advertiser Advertisers send Bid Requests Highest Bid Accepted
  10. @helenaedelson #kafkasummit 10 Site: Ad supported content Real Time Exchange & Auction (SSP): OpenRTB Server used to bid Bidder Service (DSP): OpenRTB client Advertiser:Buyer wants ad impressions. Uses bidders to bid on behalf Publisher:Seller has ad spaces to sell to highest bidders User Devices ad request winning ad bid request win notice & settlement price insert orders bid response winning ad RTB Auction for Impressions
  11. @helenaedelson #kafkasummit 11 Time Is Money RTB: Maximum response latency of 100 ms
  12. @helenaedelson #kafkasummit 12 Time Is Money Assume some network latency!
  13. @helenaedelson #kafkasummit Sampling of RTB Events Ad Request Bid Request - JSON 100 bytes Compute optimal bid for advertiser Bid Response - JSON 1000 bytes (may include ad metadata) Win Notification (may or may not exist) with settlement price Ad Impression - when the ad is viewed Ad Click Ad Conversion 13
  14. @helenaedelson #kafkasummit Event Streams Auctions: auction data + bid requests Ad Impressions: which ad ids were shown Ad Clicks: which auction ids resulted in a click Ad Conversions: streams joined on auction id Analytics Aggregations & ML to derive hundreds of metrics and dimensions 14
  15. @helenaedelson #kafkasummit 15 Real Time Just means Event-Driven or processing events as they arrive. Does not automatically equal sub-second latency requirements. Seen / Ingestion Time When an event is ingested into the system Event Time When an event is created, e.g. on a device.
  16. @helenaedelson #kafkasummit 16 The Platform
  17. @helenaedelson #kafkasummit Platform Requirements 24 / 7 Uptime Brokerage model: DSPs only make $ on successful ad deliveries, so uptime is critical Security Enable service across the globe Handle thousands of concurrent requests per second Scale to traffic of 700TB per day Manage 700TB per day of data Derive Metrics 17
  18. @helenaedelson #kafkasummit Business Requirements Support SLAs for bid transactions Legal constraints - user data crossing borders The critical path must be fast to win No data loss on ingestion path Bid & Campaign Optimization Frequency Capping Management UI for Publishers & Advertisers 18
  19. @helenaedelson #kafkasummit Questions To Answer % Writes on ingestion, analytics pre-aggregation, etc. % Reads of raw data by analytics, aggregated views by customer management UI How much in memory on RTB app nodes? Dimensions of data in analytics queries Optimization Algos What needs real time feedback loops, what does not Which data flows are low-lateny/high frequency, which not Where are potential bottlenecks 19
  20. @helenaedelson #kafkasummit Constraints Resources - I need to build highly functioning teams that are psyched about the work and working together Budget Cloud Resources JDK Version (What?!) Existing infrastructure & technologies that will be replaced later but you have to deal with now :( 20 Pro Tip: Pay well, Allow people to grow & be creative
  21. @helenaedelson #kafkasummit 21 Strategies To Avoid
  22. @helenaedelson #kafkasummit Beware of the C word Consistency? 22 Convergence?
  23. @helenaedelson #kafkasummit 23 http://www.slideshare.net/palvaro/ricon-keynote-outwards-from-the-middle-of-the-maze/39 he went there @palvaro
  24. @helenaedelson #kafkasummit Complexity 24 Can't Ops your way out of that
  25. @helenaedelson #kafkasummit 25 Occam's razor: Simpler theories are preferable to more complex
  26. @helenaedelson #kafkasummit 26 Strategies
  27. @helenaedelson #kafkasummit Approaches Eventual/Tunable consistency Time & Clocks in globally-distributed systems Location Transparency Asynchrony Pub-Sub Design for scale Design for Failure 27
  28. @helenaedelson #kafkasummit Kafka as Platform Fabric 28
  29. @helenaedelson #kafkasummit From MVP to Scalable with Kafka Microservices Does One Thing, Knows One Thing Separate low-latency hot path Separate deploy artifacts Separate data mgmt clusters by concern analytics, timeseries, etc. CQRS: Separate Read Write paths 29 Scalpel... Separate The Monolith
  30. @helenaedelson #kafkasummit Immutable events stream to Kafka, partitioned by event type, time, etc. Subscribers & Publishers RTB microservices - receives raw, receives Analytics cluster - receives raw, publishes aggregates Management / Reporting nodes 30 Services communicate indirectly via Kafka
  31. @helenaedelson #kafkasummit CQRS: Command Query Responsibility Segregation Decouple Write streams from Read streams Different schemas / data structures Writers (Publishers) publish without having awareness who needs to receive it or how to reach them (location, protocol...) Readers (Subscribers) should be able to subscribe and asynchronously receive from topics of interest 31
  32. @helenaedelson #kafkasummit 32 Eventually Consistent Across DCs US-East-1 MirrorMaker EU-west-1 RTB micro services RTB micro services RTB micro services Publishers Subscribers Subscribers Publishers Kafka Cluster Per Region ZK ZK Mgmt micro services Mgmt micro services Mgmt micro services Query Layer Analytics & ML Cluster Timeseries Cluster Spark Streaming & ML Cassandra Cross DC Replication Topology Aware Spark Streaming & ML Cassandra Spark Streaming & ML Cassandra Cross DC Replication Topology Aware Spark Streaming & ML Cassandra Compute Layer
  33. @helenaedelson #kafkasummit 33 MirrorMaker RTB micro services RTB micro services RTB micro services Publishers Subscribers Subscribers Publishers C* C* Eventually Consistent Across DCs Mgmt micro services Mgmt micro services Mgmt micro services US-East-1 EU-west-1 Kafka Cluster Per Region Analytics & ML Cluster Timeseries Cluster Spark Streaming & ML Cassandra Cross DC Replication Topology Aware Spark Streaming & ML Cassandra Spark Streaming & ML Cassandra Cross DC Replication Topology Aware Spark Streaming & ML Cassandra Compute Layer Query Layer
  34. @helenaedelson #kafkasummit Kafka Cross Datacenter Mirroring bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config config/ consumer_source_cluster.properties --producer.config config/ producer_target_cluster.properties --whitelist bidrequests --num.producers 2 -- num.streams 4 34 Publish messages from various datacenters around the world
  35. @helenaedelson #kafkasummit Users in the US and UK connect DCs in their geo region for lower latency Both DCs are part of the same cluster for X-DC Replication Configure LB policies to prefer local DC LOCAL_QUORUM reads Data is available cluster-wide for backup, analytics, and to account for user travel across regions 35 Cassandra Cross DC Replication It's out of the box. Multi-region live backups for free: [ NetworkTopologyStrategy ]
  36. @helenaedelson #kafkasummit 36 Cassandra Cross DC Replication Keep EU User Data in the EU CREATE KEYSPACE rtb WITH REPLICATION = { ‘class’: ‘NetworkTopologyStrategy’, ‘eu-east-dc’: ‘3’,‘eu-west-dc’: ‘3’ };
  37. @helenaedelson #kafkasummit 37 Cassandra Time Windowed Buckets with TTL CREATE TABLE rtb.fu_events ( id int, seen_time timeuuid, event_time timestamp, PRIMARY KEY (id,date) ) WITH CLUSTERING ORDER BY (event_time DESC) AND compaction = { 'compaction_window_unit': 'DAY', 'compaction_window_size': '3', 'class':'com.jeffjirsa.cassandra.db.compaction.TimeWindowCompactionStrategy' } AND compression = { 'crc_check_chance': '0.5', 'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor' } AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"100"}' AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 60 AND gc_grace_seconds = 0 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; 3 DAY buckets - larger SSTables on disk minimizes bootstrapping issues when adding nodes to a cluster 3 MINUTE buckets 1 HOUR buckets 1 DAY buckets MICROSECOND resolution:
  38. @helenaedelson #kafkasummit 38 Want Can Or Currently Use Status But Kafka Security Kafka Security TLS, Kerberos, SASL, Auth, Encryption, Authentication v0.9.0 Thanks Jun! Integrated Streaming Kafka Streams processing inside Kafka, no alternate cluster setup or ops. v0.10 Thanks Guozhang! It's java :( Iw Cassandra CDC Cassandra CDC. Triggers? Tiggers are a pre-commit hook :( The Epic JIRA: https://issues.apache.org/jira/browse/ CASSANDRA-8844 no comment And... Kafka Streams & Kafka Connect Integration ..wait for it.. no comment Always on, X-DC Replication, Flexible Topologies Kafka, Cassandra OOTB Fault Tolerance Kafka, Spark, Mesos, Cassandra, Akka Baked In Location Transparency Kafka, Cassandra, Akka Check! Asynchrony Kafka, Cassandra, Akka Check! Decoupling Kafka, Akka Check! Pub-Sub Kafka, Cassandra, Akka Check! Immutability Kafka, Akka, Scala Check! My Nerdy Chart v2.0
  39. @helenaedelson #kafkasummit Kafka Streams in v 0.10 39 val builder = new KStreamBuilder()
 val stream: KStream[K,V] = builder.stream(des, des, "raw.data.topic") .flatMapValues(value -> Arrays.asList(value.toLowerCase.split(" ") .map((k,v) -> new KeyValue(k,v)) .countByKey(ser, ser, des, des, "kTable") .toStream stream.to("results.topic", ...) val streams = new KafkaStreams(builder, props) streams.start()
  40. @helenaedelson #kafkasummit Kafka Streams & Kafka Connect? 40 val builder = new KStreamBuilder() val stream1: KStream[K,V] = builder.stream(new CassandraConnect(configs)) .flatMapValues(..) .map((k,v) -> new KeyValue(k,v)) .countByKey(ser, ser, des, des, "kTable") .toStream stream.to("results.topic", ...) val streams = new KafkaStreams(builder, props) streams.start() YES
  41. @helenaedelson #kafkasummit 41 /** Writes records from Kafka to Cassandra asynchronously and non-blocking. */ override def put(records: JCollection[SinkRecord]): Unit /** Returns a list of records when available by polling for new records. */ override def poll: JList[SourceRecord]) https://github.com/tuplejump/kafka-connect-cassandra
  42. @helenaedelson #kafkasummit Frequency Capping 1. Count the number of times user X has seen ad Y from Advertiser A's Campaign C 2. Limit the max number of impressions of an ad within T1...T2 42 Use Case: Continuously count impressions grouped by campaign across DCs low-latency reads & writes Must scale Cross DC Counters Translation: Distributed Counters
  43. @helenaedelson #kafkasummit Redis? Broke under the load Aerospike? Great candidate Eventuate? Interesting, much lighter Kafka streams when it's out? Interesting, already in the infra Flink? Very interesting but... Cassandra Counters - not applicable for this 43 Frequency Capping
  44. @helenaedelson #kafkasummit As a distributed counting microservice As a key-value store for in-memory caching Fast reads - Very read heavy 99% reads are < 1 ms latency (sweet) 30,000 writes per second 350,000 reads per second on 7 nodes Replication factor 2: Cross datacenter replication (XDC), SSD-backed Excellent few posts by Dag, Tapads CTO on in-memory infrastructure + Ad Tech: (see resources slide) 44 Aerospike
  45. @helenaedelson #kafkasummit CRDT: Conflict Free Replicated Data Type State-based: objects require only eventual communication between pairs of replicas Operation-based: replication requires reliable broadcast communication with delivery in a well-defined delivery order Both guaranteed to converge towards common, correct state Keep replicas available for writes during a network partition requires resolution of conflicting writes when the partition heals 45
  46. @helenaedelson #kafkasummit Eventuate A toolkit for building distributed, HA & partition-tolerant event-sourced applications. Developed by Martin Krasser (@mrt1nz) for Red Bull Media (open source) Interactive, automated conflict resolution (via op-based CRDTs) Separates command side of an app from its query side (CQRS) Primary Goals: preserving causality, idempotency & event ordering guarantees even under chaotic conditions AP of CAP - conflicts cannot be prevented & must be resolved. Causality - tracked with Vector Clocks Adapters provide connectivity to other stream processing solutions Can currently chose Cassandra if desired Kafka coming soon! 46
  47. @helenaedelson #kafkasummit Replication of application state through async event replication across locations Locations consume replicated events to re- construct application state locally Multiple locations concurrently update as multi-master 47 Eventuate as Distributed CRDT Microservice
  48. @helenaedelson #kafkasummit 48 Applications can continue writing to a local replica during a network partition -> To Cassandra -> To Kafka (soon) Pass To Pipeline:
  49. @helenaedelson #kafkasummit 49 import scala.concurrent.Future
 import akka.actor.{ActorRef, ActorSystem}
 import com.rbmhtechnology.eventuate.crdt.{CRDTServiceOps, Counter, CounterService} class CappingService(val id: String, override val log: ActorRef)
 (implicit val system: ActorSystem,
 val integral: Integral[Int],
 override val ops: CRDTServiceOps[Counter[Int], Int])
 extends CounterService[Int](id, log) {
 
 /** Increment only op: adds `delta` to the counter identified by `id` * and returns the updated counter value. */
 def increment(id: String, delta: Int): Future[Int] =
 value(id) flatMap {
 case v if v >= 0 && (delta > 0 || delta > v) =>
 update(id, delta)
 case v =>
 Future.successful(v)
 }
 
 start()
 
 } import scala.concurrent.Future import akka.actor.ActorSystem val a = new CappingService(id1, eventLog)
 a.increment(id1, 3) // Future(3) 3 impressions
 a.value(id1) // Future(3) 3 impressions
 a.increment(id1, -2) // increments only, idempotent. val b = new CappingService(id2, eventLog) b.value(id1) // Future(a.value(id1)) Knows the same count over n-instances, all geo-locations, for the same id class CounterService[A : Integral](val replicaId: String, val log: ActorRef) { def value(id: String): Future[A] = { ... } def update(id: String, delta: A): Future[A] = { ... } }
  50. @helenaedelson #kafkasummit 50 Eventuate
  51. @helenaedelson #kafkasummit Eventuate Takeaway It's just a jar! OOTB async internal component messaging and fault tolerance Integrate with relevant microservices No store/cache cluster to deploy, just keep monitoring your apps Written in Scala Built on Akka - a toolkit for building highly concurrent, distributed, and resilient event- driven applications on the JVM 51
  52. @helenaedelson #kafkasummit 52 Analytics & ML
  53. @helenaedelson #kafkasummit Refresher: Sampling of RTB Events Ad Request Bid Request - JSON 100 bytes Compute optimal bid for advertiser Bid Response - JSON 1000 bytes (may include ad metadata) Win Notification (may or may not exist) with settlement price Ad Impression - when the ad is viewed Ad Click Ad Conversion 53
  54. @helenaedelson #kafkasummit 54 OpenRTB: objects in the Bid Request model
  55. @helenaedelson #kafkasummit TopK most high performing campaigns Number of views served in the last 7 days, by country, by city What determined successful ad conversions Age distribution per campaign 55 Streaming Analytics
  56. @helenaedelson #kafkasummit Spark Streaming Kafka class KafkaStreamingActor(ssc: StreamingContext) extends MyAggregationActor { val stream = KafkaUtils.createDirectStream(...).map(RawData(_)) stream .foreachRDD(_.toDF.write.format("filodb.spark") .option("dataset", "rawdata") .save())
 /* Pre-Aggregate data in the stream for fast querying and aggregation later stream.map(hour =>
 (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip) ).saveToCassandra(timeseriesKeyspace, dailyPrecipTable) } 56 Can write to Cassandra, FiloDB...
  57. @helenaedelson #kafkasummit Machine Learning Train on 1+ week of data for Recommendations Bid Optimization Campaign Optimization Consumer Profiling ...and much more 57
  58. @helenaedelson #kafkasummit Machine Learning The probability of an ad, from a specific ISP, OS, website, demographic, etc. resulting in a conversion Which attributes of impressions are good predictors of better ad performance? 58
  59. @helenaedelson #kafkasummit Bid Optimization & Predictive Models Which impressions should an Advertiser bid for? Per campaign, per country it may run in..? What is the best bid for each impression 59
  60. @helenaedelson #kafkasummit 60 Compute optimal bid price Train the model Score bid requests Determine value of bid reqest Train on every bid req attribute Based on Campaign Objectives Against Budget Send bid decision to bidder Machine Learning
  61. @helenaedelson #kafkasummit Spark Streaming, MLLib & FiloDB 61 val ssc = new StreamingContext(sparkConf, Seconds(5))
 val kafkaStream = KafkaUtils.createDirectStream[..](..) .map(transformFunc) .map(LabeledPoint.parse) kafkaStream.foreachRDD(_.toDF.write.format("filodb.spark") .option("dataset", "training").save()) 
 val model = new StreamingLinearRegressionWithSGD()
 .setInitialWeights(Vectors.dense(weights)) .trainOn(dataStream.join(historicalEvents)) model.predictOnValues(dataStream.map(lp => (lp.label, lp.features))) .insertIntoFilo("predictions")
  62. @helenaedelson #kafkasummit 700 Queries Per Second: Spark Streaming & FiloDB Even for datasets with 15 million rows! Using FiloDB's InMemoryColumnStore Single host / MBP 5GB RAM SQL to DataFrame caching https://github.com/tuplejump/FiloDB Evan Chan's (@velvia) blog post NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics 62
  63. @helenaedelson #kafkasummit 63 Eventually Consistent Across DCs US-East-1 MirrorMaker EU-west-1 RTB micro services RTB micro services RTB micro services Publishers Subscribers Subscribers Publishers Kafka Cluster Per Region ZK ZK Mgmt micro services Mgmt micro services Mgmt micro services Query Layer Analytics & ML Cluster Timeseries Cluster Spark Streaming & ML Cassandra Cross DC Replication Topology Aware Spark Streaming & ML Cassandra Spark Streaming & ML Cassandra Cross DC Replication Topology Aware Spark Streaming & ML Cassandra Compute Layer
  64. @helenaedelson #kafkasummit Self-Healing Systems Massive event spikes & bursty traffic Fast producers / slow consumers Network partitioning & out of sync systems DC down Not DDOS'ing ourselves from fast streams No data loss when auto- scaling down 64
  65. @helenaedelson #kafkasummit Byzantine Fault Tolerance? 65 Looks like I'll miss standup
  66. @helenaedelson #kafkasummit Everything fails, all the time Monitor Everything 66
  67. @helenaedelson #kafkasummit Non-Monotonic Snapshot Isolation: scalable and strong consistency for geo-replicated transactional systems Conflict-free Replicated Data Types Implementing operation-based CRDTs http://codebetter.com/gregyoung/2010/02/16/cqrs-task-based-uis-event-sourcing-agh http://martinfowler.com/bliki/CQRS.html http://github.com/openrtb/OpenRTB http://akka.io http://rbmhtechnology.github.io/eventuate https://github.com/RBMHTechnology/eventuate http://rbmhtechnology.github.io/eventuate/user-guide.html#commutative-replicated-data-types http://www.planetcassandra.org/data-replication-in-nosql-databases-explained http://wikibon.org/wiki/v/Optimizing_Infrastructure_for_Analytics-Driven_Real-Time_Decision_Making Resources 67
  68. twitter.com/helenaedelson github.com/helena slideshare.net/helenaedelson Thanks! @helenaedelson #kafkasummit
Anúncio