O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

1.091 visualizações

Publicada em

Publicada em: Software
  • Seja o primeiro a comentar

Couchbase Live Europe 2015: Big Data Analytics with Couchbase including Hadoop, Kafka, Spark and More

  1. 1. Big Data Analytics with Couchbase Hadoop, Kafka, Spark and More Matt Ingenthron, Sr. Director Michael Nitschinger, Software Engineer
  2. 2. • Define Problem Domain • How Couchbase fits in • Demo • Q&A Agenda 2
  3. 3. Lambda Architecture 4 1 2 3 4 5 DATA BATCH SPEED SERVE QUER Y
  4. 4. Lambda Architecture 5 Interactive and Real Time Applications 1 2 3 4 5 DATA BATCH SPEED SERVE QUER YHADOOP COUCHBASE STORM COUCHBASEBroker Cluster Spout for Topic Kafka Producers Ordered Subscriptions
  5. 5. COMPLEX EVENT PROCESSING Real Time REPOSITORY PERPETUAL STORE ANALYTICAL DB BUSINESS INTELLIGENCE MONITORING CHAT/VOICE SYSTEM BATCH TRACK REAL-TIME TRACK DASHBOARD
  6. 6. TRACKING and COLLECTION ANALYSIS AND VISUALIZATION REST FILTER METRICS
  7. 7. Integration at Scale
  8. 8. 11 Requirements for data streaming in modern systems… • Must support high throughput and low latency • Need to handle failures • Pick up where you left off • Be efficient about resource usage
  9. 9. Data Sync is the Heart of Any Big Data System Fundamental piece of the architecture - Data Sync maintains Data Redundancy for High Availability (HA) & Disaster Recovery (DR) - Protect against failures – node, rack, region etc. - Data Sync maintains Indexes - Indexing is key to building faster access paths to query data - Spatial, Full-text DCP and Couchbase Server Architecture
  10. 10. What is DCP? DCP is an innovative protocol that drive data sync for Couchbase Server • Increase data sync efficiency with massive data footprints • Remove slower Disk-IO from the data sync path • Improve latencies – replication for data durability • In future, will provide a programmable data sync protocol for external stores outside Couchbase Server DCP powers many critical components What is DCP? 14
  11. 11. Demo
  12. 12. 27 Other Data Sources HDFS Shopper Tracking (click stream) Lightweight Analytics: • Department shopped • Tech platform • Click tracks by Income Heavier Analytics, Develop Profiles
  13. 13. 28
  14. 14. 29 HDFS Kafka Consumer or CAMUS Producer Producer Producer Producer And at scale…
  15. 15. Couchbase & Apache Spark Introduction & Integration
  16. 16. What is Spark? Apache is a fast and general engine for large-scale data processing.
  17. 17. Spark Components Spark Core: RDDs, Clustering, Execution, Fault Management
  18. 18. Spark Components Spark SQL: Work with structured data, distributed SQL querying
  19. 19. Spark Components Spark Streaming: Build fault-tolerant streaming applications
  20. 20. Spark Components Mlib: Machine Learning built in
  21. 21. Spark Components GraphX: Graph processing and graph-parallel computations
  22. 22. Spark Benefits • Linear Scalability • Ease of Use • Fault Tolerance • For developers and data scientists • Tight but not mandatory Hadoop integration
  23. 23. Spark Facts • Current Release: 1.3.0 • Over 450 contributors, most active Apache Big Data project. • Huge public interest: Source: http://www.google.com/trends/explore?hl=en-US#q=apache%20spark,%20apache%20hadoop&cmpt=q
  24. 24. Daytona GraySort Performance Hadoop MR Record Spark Record Data Size 102.5 TB 100 TB Elapsed Time 72 mins 23 mins # Nodes 2100 206 # Cores 50400 physical 6592 virtual Cluster Disk Throughput 3150 GB/s 618 GB/s Network Dedicated DC, 10Gbps EC2, 10Gbps Sort Rate 1.42 TB/min 4.27 TB/min Sort Rate/Node 0.67 GB/min 20.7 GB/min Source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Benchmark: http://sortbenchmark.org/
  25. 25. How does it work? Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf RDD Creation Scheduling DAG Task Execution
  26. 26. How does it work? Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf RDD Creation Scheduling DAG Task Execution
  27. 27. How does it work? Resilient Distributed Datatypes paper: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf RDD Creation Scheduling DAG Task Execution
  28. 28. Spark vs Hadoop • Spark is RAM while Hadoop is HDFS (disk) bound • API easier to reason about & to develop against • Fully compatible with Hadoop Input/Output formats • Hadoop more mature, Spark ecosystem growing fast
  29. 29. Ecosystem Flexibility RDBMS Streams Web APIs DCP KV N1QL Views Batching Archived Data OLTP
  30. 30. Infrastructure Consolidation Streams Web APIs User Interaction
  31. 31. Couchbase Connector Spark Core  Automatic Cluster and Resource Management  Creating and Persisting RDDs  Java APIs in addition to Scala (planned before GA) Spark SQL  Easy JSON handling and querying  Tight N1QL Integration (dp2) Spark Streaming  Persisting DStreams  DCP source (planned before GA)
  32. 32. Connector Facts • Current Version: 1.0.0-dp • DP2 upcoming • GA planned for Q3 Code: https://github.com/couchbaselabs/couchbase-spark-connector Docs until GA: https://github.com/couchbaselabs/couchbase-spark- connector/wiki
  33. 33. Questions
  34. 34. Matt Ingenthron @ingenthr Michael Nitschinger @daschl Thanks
  35. 35. Additional Slides
  36. 36. Use Case at Linkedin 52
  37. 37. • Site Reliability Engineer (SRE) at LinkedIn • SRE for Profile & Higher-Education • Member of LinkedIn’s CBVT • B.E. (Electrical Engineering) from the University of Queensland, Australia Michael Kehoe
  38. 38. • Kafka was created by LinkedIn • Kafka is a publish-subcribe system built as a distributed commit log • Processes 500+ TB/ day (~500 billion messages) @ LinkedIn Kafka @ LinkedIn
  39. 39. • Monitoring • InGraphs • Traditional Messaging (Pub-Sub) • Analytics • Who Viewed my Profile • Experiment reports • Executive reports • Building block for (log) distributibuted applications • Pinot • Espresso LinkedIn’s uses of Kafka
  40. 40. Use Case: Kafka to Hadoop (Analytics) • LinkedIn tracks data to better understand how members use our products • Information such as which page got viewed and which content got clicked on are sent into a Kafka cluster in each data center • Some of these events are all centrally collected and pushed onto our Hadoop grid for analysis and daily report generation
  41. 41. Couchbase @ LinkedIn • About 25 separate services with one or more clusters in multiple data centers • Up to 100 servers in a cluster • Single and Multi-tenant clusters
  42. 42. Use Case: Jobs Cluster • Read scaling, Couchbase ~80k QPS, 24 server cluster(s) • Hadoop to pre-build data by partition • Couchbase 99 percentile latencies
  43. 43. Hadoop to Couchbase • Our primary use-case for Hadoop  Couchbase is for building (warming) / recovering Couchbase buckets • LinkedIn built it’s own in-house solution to work with our ETL processes, cache invalidation procedures etc

×