A text without a context is a pretext. – D. A. Carson
If I were to give you a graph with a single point on it and ask you to extract trending information from it, you’d think I’d gone mad. With only a single point of data, how could anyone extract anything meaningful from it? And yet we do it all the time. We extrapolate meaning from a single occurrence ignoring the larger context.
After seeing a friend’s post on social media, haven’t we all caught ourselves thinking that our friend’s life seemed to be so much more exciting than our own? Or what about the stock market? As I write this, we appear to be entering a recession, and because of this, how many people are pulling their money out of the market fearing things will only get worse?
However, when we take a step back to incorporate information from more than a single source or a snapshot in time we gain perspective and see emerging patterns. After all, people rarely share their fears, failures, and struggles on social media – at least, not genuinely – and the stock market is cyclical with regard to recessions, and bull and bear markets.
In my last article, The “How”s, “What”s, and “Why”s of Elixir Telemetry, we looked at the value of monitoring our projects, how to do so with Telemetry, and the dangers of tracking the wrong metrics. After reading the article, however, you may have struggled to determine what to do with the data you captured.
Monitoring and capturing data is only part of the equation. In this article, we’ll look at the difference between measurements and metrics, what makes a good metric, things to keep in mind when choosing metrics, and finally how to start capturing metrics with Elixir’s Telemetry.Metrics library.
“Metric” is a difficult word to define and is oftentimes confused with “measurement”. Whereas a measurement is a unit-specific value for something (e.g. length, duration, volume, etc), a metric is used to track and assess measurements (e.g. summaries, averages, distributions, etc.)
As explained in the book, “Lean Analytics”, there are four rules of thumb for what makes a good metric:
Perhaps this is obvious since measuring implies comparison, but it’s not uncommon to erroneously compare two unrelated metrics. Benchmarks are excellent examples of a good comparison, because you are comparing two similar things such as comparable libraries or different versions of the same library. A poor comparison might be lines of code written by two different developers: who can write more doesn’t translate to something valuable.
Programmers understand terms like “code complexity”, “technical debt”, and “refactoring”, but most people don’t. While complexity can be quantified – sort of – technical debt can’t be, because it must be compared to an unknowable ideal version of the software. When choosing a metric to track, choose those which can be reasoned about and understood. “If people can’t remember [the metric] and discuss it, it’s much harder to turn a change in the data into a change in the culture.”
Unless the measurements you’re tracking are grouped in some way all you’re really doing is collecting a bunch of numbers. Instead, we need to track our metrics as they relate to time or resources. Examples of common ratios are throughput (transactions per time period) and bandwidth (bits per available resource). There are three reasons “ratios tend to be the best metrics”:
Why do you want to monitor your processes? Is it to show off charts on Twitter and to management? Or instead, is it to enable you to understand the health of your project? As discussed in the previous article, “the whole point of capturing metrics about your system is to enable you to make good decisions about what to do next.” Furthermore, good metrics change your behavior “precisely because [they’re] aligned to your goals”.
If merely capturing measurements about our system isn’t enough, and instead we need to be capturing metrics, then what are we to do with the Telemetry library? After all, Telemetry only captures measurements. Thankfully, the Telemetry team has provided us with a library to help us think about the measurements and metrics we need to capture.
When I began exploring the Telemetry.Metrics
library I made the mistake of
assuming it would accumulate metrics for me. I misunderstood the first sentence
on the README: “Telemetry.Metrics provides a common interface for defining
metrics based on :telemetry events.” (Emphasis: mine). Telemetry.Metrics
does
little more than provide and validate the different metric types and offer unit
conversions for time-based measurements. This makes sense in retrospect: how
could the Telemetry team know the many different ways we would want to aggregate
data?
Telemetry.Metrics
currently provides the following five metric types:
To start tracking one or more of the five metric types, you’ll need to create a reporter. To reiterate, in case you’re as thick as I am:
…the metric definitions themselves are not enough, as they only provide the specification of what is the expected end-result. The job of subscribing to events and building the actual metrics is a responsibility of reporters. This is the crucial part of this library design - it doesn’t aggregate events itself but relies on 3rd party reporters to perform this work in a way that makes the most sense for a particular monitoring system.
Building a reporter isn’t difficult, and to demonstrate we’ll build one to
capture metrics from Plug.Telemetry
. When complete, we’ll be able to include
our reporter in a Plug or Phoenix project, enabling it to capture the following
information:
Note: This is strictly for demonstration purposes and should not be used as-is for production purposes.
As covered in the Writing Reporters documentation, a reporter has four responsibilities:
We’ll build our reporter using these responsibilities as our guide.
Before we can even begin creating our reporter we’ll need to install
Telemetry
and set up Plug.Telemetry
. First things first: add the following
to the deps/0
function in your project’s mix.exs file
defp deps() do
[
{:telemetry, "~> 0.4.0"},
{:telemetry_metrics, "~> 0.3"},
]
end
With Telemetry installed, we are now free to include Plug.Telemetry
in our
project’s pipeline:
defmodule MyApp.Router do
use MyAppWeb, :router
pipeline :browser do
...
plug Plug.Telemetry, event_prefix: [:my_app, :plug]
...
end
...
end
With the basics out of the way, we can start building our reporter.
Although not necessary, it’s recommended that “most reporters should be backed by a process”. With this in mind, we’ll create our reporter as a GenServer. First, however, we need to determine which events we want to capture. Since this is a tutorial, let’s go ahead and capture them all.
In the start/2
function of our project’s Application
module, let’s assign
the metrics we want to monitor to the metrics
variable and then provide that
to what will eventually become our PlugReporter
GenServer:
defmodule MyApp.Application do
use Application
def start(_type, _args) do
metrics = [
Telemetry.Metrics.counter("myapp.plug.stop.duration"),
Telemetry.Metrics.sum("myapp.plug.stop.duration"),
Telemetry.Metrics.last_value("myapp.plug.stop.duration", unit: {:native, :millisecond}),
Telemetry.Metrics.summary("myapp.plug.stop.duration", unit: {:native, :millisecond}),
Telemetry.Metrics.distribution("myapp.plug.stop.duration", buckets: [200, 500, 1000], unit: {:native, :millisecond}),
]
children = [
...
{MyApp.PlugReporter, metrics: metrics}
]
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end
...
end
For each metric, we provide the same event/measurement combination,
myapp.plug.stop.duration
. The event is myapp.plug.stop
, while duration
is
the measurement. This is saying we want to capture “the time it takes to parse
the request, dispatch it to the correct handler, and execute the handler”, but
we only want to do so for events matching “myapp.plug.stop”.
For three of the metrics, last_value
, summary
and distribution
, we also
include the “unit” option instructing Telemetry.Metrics
to transform the
duration value to milliseconds from the “native” value (i.e. nanoseconds).
Finally, we provide a “buckets” option to the distribution
metric allowing us
to group durations by preset values. By doing this, we can create a histogram of
page loads by specified values enabling us to see how frequently pages fall
into, or out of, desired ranges.
Next, we need to build the PlugReporter
so we can attach handlers to the
different Telemetry events.
As we’ve seen, metrics are the accumulation of measurements as they relate to time, occurrence, or resource. If we are to track this accumulation, we’ll need a means of storing the measurements. We could store these values with a GenServer, but using ETS will provide more flexibility.
Here then is our initial reporter:
defmodule MyApp.PlugReporter do
use GenServer
require Logger
alias Telemetry.Metrics.{Counter, Distribution, LastValue, Sum, Summary}
def start_link(opts) do
GenServer.start_link(__MODULE__, opts[:metrics])
end
def init(metrics) do
Process.flag(:trap_exit, true)
:ets.new(:metrix, [:named_table, :public, :set, {:write_concurrency, true}])
groups = Enum.group_by(metrics, & &1.event_name)
for {event, metrics} <- groups do
id = {__MODULE__, event, self()}
:telemetry.attach(id, event, &handle_event/4, metrics)
end
{:ok, Map.keys(groups)}
end
end
From the code above, you can see most of the heavy lifting occurs in the
init/1
function. Here, we perform four actions: trap “exits” to be handled by
our terminate/2
callback, initialize a new ETS table, attach our telemetry
events to an event handler, and finally return state. Let’s look at each in more
detail.
First, we trap exit signals which gives our GenServer the opportunity to handle
necessary cleanup before finally terminating. We will handle this cleanup in
the terminte/2
callback we’ll define later.
Next, we initialize an ETS table as a public, “named table”, which stores its data in a set. We do this to simplify calling the correct table while ensuring the data is unique. Finally, we turn on “write concurrency” to allow concurrent write access.
Our third action is to “attach” event handlers to the metrics we’re tracking. We do so by looping through each event group, creating unique ids, and associating those IDs to the handled event.
Finally, we set the state which we can use later to detach our event handlers.
Each time an event comes in, we want to handle it according to each of the metrics we’re watching. To do this, we need to iterate through the list of metrics, passing the measurements and metadata to each.
def handle_event(_event_name, measurements, metadata, metrics) do
metrics
|> Enum.map(&(handle_metric(&1, measurements, metadata)))
end
With this in place, we can match each metric to their own function. But before
we do that we need to ensure we use the correct measurement value. As we saw
when defining our metrics, we can specify a unit conversion. When we specify a
unit conversion, Telemetry.Metrics
returns a function instead of a value for
the measurement. Therefore, we need to handle both cases and we do that with
the following function:
defp extract_measurement(metric, measurements) do
case metric.measurement do
fun when is_function(fun, 1) ->
fun.(measurements)
key ->
measurements[key]
end
end
The %Counter{}
metric is “used to keep a count of the total number of events.”
Because we are using ETS, we can take advantage of its update_counter/4
function. This function “is guaranteed to be atomic and isolated” and simply
increments the existing counter value by the provided amount. The fourth
argument provides the default initial value if none exists, so we don’t need to
worry about adding to nil
.
def handle_metric(%Counter{}, _measurements, _metadata) do
:ets.update_counter(:metrix, :counter, 1, {:counter, 0})
Logger.info "Counter - #{inspect :ets.lookup(:metrix, :counter)}"
end
While not terribly useful in the context of web applications, there are times
when you want to store the last measurement captured by your application. Rather
than incrementing a counter value, we’ll use insert/2
to store the last
value. If a value is already present, insert/2
overwrites it.
def handle_metric(%LastValue{} = metric, measurements, _metadata) do
duration = extract_measurement(metric, measurements)
key = :last_pageload_time
:ets.insert(:metrix, {key, duration})
Logger.info "LastValue - #{inspect :ets.lookup(:metrix, key)}"
end
Like %Counter{}
, we can use update_counter/4
with the %Sum{}
metric to
increment the value. However, instead of incrementing the value by 1
, here we
increment the sum by the total byte_size
of the page body.
def handle_metric(%Sum{}, _measurements, %{conn: conn} = metadata) do
key = :bytes_transmitted
body = IO.iodata_to_binary(conn.resp_body)
:ets.update_counter(:metrix, key, byte_size(body), {key, 0})
Logger.info "Sum - #{inspect :ets.lookup(:metrix, key)}"
end
The %Summary{}
metric is “used to track and calculate statistics”. Values such
as minimum, maximum, standard deviations, and percentiles are excellent choices
to capture. You may wish to capture averages here, but unless it is used in
conjunction with percentiles, the values may be misleading. “[T]he average hides
outliers …, which are the ones you are interested in.” (Latency: A Primer)
In the handle_metric/3
function below, we are simply updating (using insert/2
) the :summary
key with the min and max values for duration.
def handle_metric(%Summary{} = metric, measurements, _metadata) do
duration = extract_measurement(metric, measurements)
summary =
case :ets.lookup(:metrix, :summary) do
[summary: {min, max}] ->
{
min(min, duration),
max(max, duration)
}
_ ->
{duration, duration, 1, duration}
end
:ets.insert(:metrix, {:summary, summary})
Logger.info "Summary - #{inspect summary}"
end
The last metric you may wish to track is %Distribution{}
. As already
discussed, this metric allows you to provide a collection of integers by which
to group the measurements. For example, if we defined the buckets as [200, 500,
1000]
as we did above in the MyApp.Application
module, we can then group the
page views based on how long it took to build them: less than 200 ms, less than
500 ms, less than 1,000 ms, and then everything over that.
def handle_metric(%Distribution{} = metric, measurements, _metadata) do
duration = extract_measurement(metric, measurements)
update_distribution(metric.buckets, duration)
Logger.info "Distribution - #{inspect :ets.match_object(:metrix, {{:distribution, :_}, :_})}"
end
defp update_distribution([], _duration) do
key = {:distribution, "1000+"}
:ets.update_counter(:metrix, key, 1, {key, 0})
end
defp update_distribution([head|_buckets], duration) when duration <= head do
key = {:distribution, head}
:ets.update_counter(:metrix, key, 1, {key, 0})
end
defp update_distribution([_head|buckets], duration) do
update_distribution(buckets, duration)
end
For the update_distribution/2
function, we take advantage of Elixir’s matching
magic to recursively check if the duration is less than a specific bucket. If it
is, we increment the value for that bucket. If not we look at the next value
until there are none left, in which case we increment the value for “1000+”.
The last step of building a reporter is to handle termination, leaving our
system in a clean state if things go south. In our example, we do this by
implementing the GenServer.terminate/2
callback. We’re already “trapping”
exits and providing the list of events through our GenServer’s state, now we
just need to loop through those events, detaching from each.
def terminate(_, events) do
events
|> Enum.each(&(:telemetry.detach({__MODULE__, &1, self()})))
:ok
end
With the final step in place, we have a complete, if not particularly useful, reporter to track Plug events. If you were to build this for a production system you would, at the very least, want to track each of the metrics by the page on which the event occurred and some sort of timestamp (hour, minute, or second depending on the site’s traffic). Furthermore, you would want to break each of the metrics out into their own module, and also put serious thought into which metrics are the most valuable to you and which you should ignore.
Knowing that measurements by themselves provide only an isolated perspective
and with a functional reporter in hand, you may now be left wondering what
purpose there is in using the Telemetry
library outside of
Telemetry.Metrics
. There are at least three reasons:
Notice that in the first two cases, they are only valuable with context. In the first case, you must watch the numbers as you test your feature over and over. This is a metric: performance over iteration. In the second case, you must already know the boundaries in order to know what measurements are outside of them and thus trigger an event.
In the same way that a photograph captures only the briefest moment in time, individual measurements provide only the slightest glimpse into your system. To see beyond that glimpse we must aggregate measurements into metrics, focusing on those which are comparative, understandable, ratios, and most importantly, those which change the way we behave.