O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Rethinking metrics: metrics 2.0

774 visualizações

Publicada em

Publicada em: Engenharia
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Rethinking metrics: metrics 2.0

  1. 1. rethinking metrics: metrics 2.0
  2. 2. by niteroi @ panoramio.com
  3. 3. vimeo.com/43800150
  4. 4. problems Metrics 2.0 concepts implementations uses & ideas
  5. 5. terminology sync
  6. 6. (1234567890, 82) (1234567900, 123) (1234567910, 109) (1234567920, 77) db15.mysql.queries_running host=db15 mysql.queries_running
  7. 7. Problems
  8. 8. Vimeo.com pagerequests/s? server X write perf?
  9. 9. stats.hits.vimeo_com stats_counts.hits.vimeo_com stats.*.vimeo_requests collectd.db.disk.sda1.disk_time.write
  10. 10. Understanding metrics Terminology? Meaning? Prefix? Unit? Aggregation? Source?
  11. 11. Unclear, inconsistent terminology, format tightly coupled lack information
  12. 12. http://litlquest.com/forest-trees/see-forest-trees-2
  13. 13. O(S*P*A*C) S = # Sources P = # People A = # Aggregators C = #Complexity
  14. 14. Graphs and dashboards are a huge time sink.
  15. 15. metrics 2.0 concepts
  16. 16. Self-describing Standardized Orthogonal dimensions
  17. 17. stats.timers.dfs5. proxy-server.object.GET.200. timing.upper_90
  18. 18. { server: dfvimeodfsproxy5, http_method: GET, http_code: 200, unit: ms, metric_type: gauge, stat: upper_90, swift_type: object }
  19. 19. allow more characters unit: Req/s, site: vimeo.com, ...
  20. 20. Metadata meta: { src: proxy.py:458, from: diamond }
  21. 21. Datamodel
  22. 22. Any protocol
  23. 23. Source format … service=foo instance=host unit=B 123 1234567890 {s}foo.{i}host.{u}B 123 1234567890 <uuid> 125 1234567890 #seperate data …
  24. 24. metrics20.org
  25. 25. MB/s Err/d Req/h ... B Err Warn Conn Job File Req ... SI + IEC
  26. 26. Immediate understanding of metrics Minimize time to graphs, alerting rules, debugging compatibility & flexibility in tooling
  27. 27. Implementations examples
  28. 28. Carbon-tagger … stats.gauges.host.foo 125 1234567890 service=foo instance=host target_type=gauge unit=B 123 1234567890 …
  29. 29. Statsdaemon unit=B unit=B ... unit=ms unit=ms ... unit=B/s unit=ms stat=mean unit=ms stat=upper_90 ...
  30. 30. Keep metric tags in sync with data
  31. 31. Graphing & dashboarding Visualization Alerting
  32. 32. Graphing & Dashboarding
  33. 33. Graph Explorer
  34. 34. Graph-Explorer queries 101 proxy-server swift server:regex unit=ms (AND)
  35. 35. upper_90 (or stat=upper_90) from <datetime> to <datetime> avg over <timespec> (5M, 1h, 3d, ...)
  36. 36. Compare object put/get stack … http_method:(PUT|GET) swift_type=object avg by http_code,server
  37. 37. Comparing servers http_method:(PUT|GET) group by unit,target_type avg by http_code, swift_type,http_method
  38. 38. transcode unit=Job/s avg over <time> from <datetime> to <datetime>
  39. 39. Note: data is obfuscated
  40. 40. Bucketing sum by zone:eu-west|us-east| ap-southeast|us-west| sa-east|vimeo-df|vimeo-lv group by state
  41. 41. Note: data is obfuscated
  42. 42. Compare job states per region (zones bucket) group by zone
  43. 43. Note: data is obfuscated
  44. 44. Unit conversion unit=Mb/s network server:regex sum by server
  45. 45. Integration Metric unit=B/s Query unit=TB
  46. 46. Deriving Metric unit=B Query unit=GB/d
  47. 47. Future work Faced-based suggestions Custom trees
  48. 48. Dashboard definition queries = [ 'cpu usage sum by core', 'mem unit=B !total group by type:swap', 'stack network unit=Mb/s', 'unit=B (free|used) group by =mountpoint' ]
  49. 49. Equivalence servers.host.cpu.total.iowait → “core” : “_sum_” servers.host.cpu.<core-number>.iowait servers.host.loadavg.15
  50. 50. Future Work
  51. 51. ●Storage aggregation rules ● graphite API functions such as cumulative, summarize and smartSummarize ●consolidateBy & Graph renderers
  52. 52. Self-describing & standardized stat=upper/lower/mean/... target_type=counter..
  53. 53. Visualizations
  54. 54. From: dygraphs.com
  55. 55. Select your view
  56. 56. bin=10 bin=20 bin=30 bin=40 bin=50 bin=100
  57. 57. Alerting
  58. 58. unit=Err/s
  59. 59. Automatic cause & effect
  60. 60. Different algo's for different things
  61. 61. Alert criticality & routing based on tags
  62. 62. integrating logs & metrics
  63. 63. Algorithms leverage both logs and metrics
  64. 64. Changing software
  65. 65. Conclusion structured self-describing standardized metrics = enabler
  66. 66. Conclusion What are your concerns? Ideas? Let's make this better Ready for early adopters! Work with me on next-gen telemetry! Tips on coordinating spec development? How does FB/G/AMZ/MS/APL/... do this stuff
  67. 67. Seen in this presentation: metrics20.org vimeo.github.io/graph-explorer github.com/vimeo/timeserieswidget github.com/vimeo/carbon-tagger github.com/vimeo/statsdaemon github.com/graphite-ng/carbon-relay-ng github.com/Dieterbe/anthracite
  68. 68. You might also like: github.com/vimeo/graphite-influxdb github.com/vimeo/graphite-api-influxdb-docker Github.com/vimeo/whisper-to-influxdb github.com/Dieterbe/influx-cli github.com/graphite-ng/graphite-ng Github.com/vimeo/smoketcp Github.com/vimeo/tailgate
  69. 69. Stay in touch! groups.google.com/forum/#!forum/metrics20 groups.google.com/forum/#!forum/it-telemetry twitter.com/Dieter_be dieter.plaetinck.be
  70. 70. Q&A

×