O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Rethinking metrics: metrics 2.0 @ Lisa 2014

1.167 visualizações

Publicada em

As the amount of metrics, software that produce and process them, and people involved in them continue to increase, we need better ways to organize them, to make them self-describing, and do so in a way that is consistent. Leveraging this, we can then automatically build graphs and dashboards, given a query that represents an information need, even for complicated cases. We can build richer visualizations, alerting and fault detection. This talk will introduce the concepts and related tools, demonstrate possibilities using the Graph-Explorer interface, and lay the groundwork for future work.

Publicada em: Engenharia
  • Seja o primeiro a comentar

Rethinking metrics: metrics 2.0 @ Lisa 2014

  1. 1. rethinking metrics: metrics 2.0
  2. 2. instagram.com/wrongrob
  3. 3. vimeo.com/43800150
  4. 4. problems Metrics 2.0 concepts implementations uses & ideas
  5. 5. Mostly graphite
  6. 6. terminology sync
  7. 7. (1234567890, 82) (1234567900, 123) (1234567910, 109) (1234567920, 77) db15.mysql.queries_running host=db15 mysql.queries_running
  8. 8. Problems
  9. 9. Vimeo.com pagerequests/s? server X disk write?
  10. 10. stats.hits.vimeo_com stats_counts.hits.vimeo_com stats.*.vimeo_requests collectd.db.disk.sda1.disk_time.write
  11. 11. Understanding metrics Terminology? Meaning? Prefix? Unit? Aggregation? Source?
  12. 12. Unclear, inconsistent terminology, format tightly coupled lack information
  13. 13. http://litlquest.com/forest-trees/see-forest-trees-2
  14. 14. O(S*P*A*C) S = # Sources P = # People A = # Aggregators C = #Complexity
  15. 15. Graphs and dashboards are a huge time sink.
  16. 16. metrics 2.0 concepts
  17. 17. Self-describing Standardized Orthogonal dimensions
  18. 18. stats.timers.dfs5. proxy-server.object.GET.200. timing.upper_90
  19. 19. { server: dfvimeodfsproxy5, http_method: GET, http_code: 200, unit: ms, metric_type: gauge, stat: upper_90, swift_type: object }
  20. 20. MB/s Err/d Req/h ... B Err Warn Conn File Req … SI + IEC
  21. 21. allow more characters unit: Req/s, site: vimeo.com, ...
  22. 22. Metadata meta: { src: proxy.py:458, from: diamond }
  23. 23. metrics20.org
  24. 24. Immediate understanding of metrics Minimize time to graphs, alerting, troubleshooting compatibility & flexibility in tooling
  25. 25. Implementations getting the data
  26. 26. Source formats … service=foo instance=host unit=B 123 1234567890 {s}foo.{i}host.{u}B 123 1234567890 <uuid> 125 1234567890 #seperate data …
  27. 27. Carbon-tagger … stats.gauges.host.foo 125 1234567890 service=foo instance=host target_type=gauge unit=B 123 1234567890 …
  28. 28. Statsdaemon unit=B unit=B ... unit=ms unit=ms ... unit=B/s unit=ms stat=mean unit=ms stat=upper_90 ...
  29. 29. Keep metric tags in sync with data
  30. 30. Implementations Graphing & dashboarding Visualization Alerting
  31. 31. Graphing & Dashboarding
  32. 32. Graph Explorer
  33. 33. Graph-Explorer queries 101 proxy-server swift server:regex unit=ms (AND)
  34. 34. upper_90 (or stat=upper_90) from <datetime> to <datetime> avg over <timespec> (5M, 1h, 3d, ...)
  35. 35. Compare object put/get stack … http_method:(PUT|GET) swift_type=object avg by http_code,server
  36. 36. Comparing servers http_method:(PUT|GET) group by unit,target_type avg by http_code, swift_type,http_method
  37. 37. transcode unit=Job/s avg over <time> from <datetime> to <datetime>
  38. 38. Note: data is obfuscated
  39. 39. Bucketing sum by zone:eu-west|us-east| ap-southeast|us-west| sa-east|vimeo-df|vimeo-lv group by state
  40. 40. Note: data is obfuscated
  41. 41. Compare job states per region group by zone
  42. 42. Note: data is obfuscated
  43. 43. Unit conversion unit=Mb/s network server:regex sum by server
  44. 44. Integration Metric unit=B/s Query unit=TB
  45. 45. Deriving Metric unit=B Query unit=GB/d
  46. 46. Highly extensible Equal rights for all tags → real world use drives spec
  47. 47. Err/s Anything/s MAnything/d B Err Conn File Req Anything SI + IEC
  48. 48. Minimize time from information need to insights.
  49. 49. Future Work
  50. 50. Faced-based suggestions Custom hierachies
  51. 51. Tag insights
  52. 52. ●Storage aggregation rules ● graphite API functions such as cumulative, summarize and smartSummarize ●consolidateBy & Graph renderers
  53. 53. stat=upper/lower/mean/... (assume avg otherwise)
  54. 54. Visualizations
  55. 55. From: dygraphs.com
  56. 56. bin=10 bin=20 bin=30 bin=40 bin=50 bin=100
  57. 57. Alerting
  58. 58. unit=Err/s
  59. 59. Classifying clusters of cause & effect
  60. 60. Different algos for different metric categories
  61. 61. Alert criticality & routing based on tags
  62. 62. integrating logs & metrics
  63. 63. Algorithms leverage both logs and metrics
  64. 64. Conclusion structured self-describing standardized metrics = enabler
  65. 65. Conclusion Concerns? Ideas? Advice? Ready for early adopters! Work with me on next-gen telemetry!
  66. 66. Seen in this presentation: metrics20.org vimeo.github.io/graph-explorer github.com/vimeo/timeserieswidget github.com/vimeo/carbon-tagger github.com/vimeo/statsdaemon github.com/graphite-ng/carbon-relay-ng github.com/Dieterbe/anthracite
  67. 67. You might also like: github.com/vimeo/graphite-influxdb github.com/vimeo/graphite-api-influxdb-docker Github.com/vimeo/whisper-to-influxdb github.com/Dieterbe/influx-cli github.com/graphite-ng/graphite-ng Github.com/vimeo/smoketcp Github.com/vimeo/tailgate
  68. 68. Stay in touch! Metrics20 google group it-telemetry google group twitter.com/Dieter_be dieter.plaetinck.be dieter@plaetinck.be dieter@vimeo.com Lisa labs office hours after lunch Q & A
  69. 69. Bonus round
  70. 70. Dashboard definition queries = [ 'cpu usage sum by core', 'mem unit=B !total group by type:swap', 'stack network unit=Mb/s', 'unit=B (free|used) group by =mountpoint' ]
  71. 71. Catchall plugins stats.dfvimeocliapp2.twitter.error { “n1”: “dfvimeocliapp2”, “n2”: “twitter”, “n3”: “error”, “plugin”: “catchall_statsd”, “source”: “statsd”, “target_type”: “rate”, “unit”: “unknown/s” }
  72. 72. Equivalence servers.host.cpu.total.iowait → “core” : “_sum_” servers.host.cpu.<core-number>.iowait servers.host.loadavg.15

×