A monitoring dashboard tumbling off a cliff while every status indicator reads OK

Know before they tell you.

Your checkout success rate dropped 12% overnight. You know this because a customer posted on Twitter. Six hours of lost revenue. CEO asking questions. You pull logs, grep for errors, find nothing obvious because failures were scattered across services with no common signature.

Metrics would have caught this at 3am. A dashboard showing checkout.success_rate dipping below threshold triggers an alert. You investigate before anyone notices.

Three Types

Counters go up. Never down. Requests served, errors caught, bytes sent.

Histograms track distributions. Response times, payload sizes. You get averages, percentiles, and counts from a single instrument.

Gauges go up and down. Active connections, queue depth, memory usage. They represent current state.

What to Measure

Start with two metrics: request count and request duration. These give you throughput and latency across every endpoint.

// These two metrics cover most operational concerns
incrementCounter("http.server.request.count", {
  "http.request.method": method,
  "http.route": route,
  "http.response.status_code": status,
});
recordDuration("http.server.request.duration", elapsed, {
  "http.request.method": method,
  "http.route": route,
});

Add dimensions to slice the data. A bare request counter tells you almost nothing. Break it down by http.request.method, http.route, and http.response.status_code and you can answer questions like "which endpoint has the most 500s?"

DOUse histograms for latency. An average of 50ms means nothing if p99 is 5 seconds.

Once operational metrics are in place, add business metrics. Orders completed, payments processed, cart abandonment rate. These connect infrastructure health to revenue.

// Business metrics connect code to outcomes
incrementCounter("order.completed", {
  "payment.method": paymentMethod,
  "deployment.region": region,
});
incrementCounter("cart.abandoned", { "checkout.step": "payment" });

The Abstraction

Start with named functions that match your domain. Application code calls these. The implementation can change without touching call sites.

metrics.ts

type Attributes = Record<string, string | number | boolean>;

const counters: Record<string, number> = {};
const histograms: Record<string, number[]> = {};

export function incrementCounter(name: string, attributes?: Attributes) {
  const key = serializeKey(name, attributes);
  counters[key] = (counters[key] ?? 0) + 1;
  console.log("[metric:counter]", name, attributes);
}

export function recordDuration(name: string, ms: number, attributes?: Attributes) {
  const key = serializeKey(name, attributes);
  (histograms[key] ??= []).push(ms);
  console.log("[metric:histogram]", name, ms, attributes);
}

export async function timed<T>(
  name: string,
  fn: () => Promise<T>,
  attributes?: Attributes
): Promise<T> {
  const start = performance.now();
  try {
    return await fn();
  } finally {
    recordDuration(name, performance.now() - start, attributes);
  }
}

function serializeKey(name: string, attributes?: Attributes): string {
  if (!attributes) return name;
  const sorted = Object.entries(attributes).sort(([a], [b]) => a.localeCompare(b));
  return `${name}:${sorted.map(([k, v]) => `${k}=${v}`).join(",")}`;
}

This logs to console during development. In production, swap the implementation. The call sites stay the same.

Using It

order-service.ts

import { incrementCounter, timed } from "./metrics";

export async function placeOrder(userId: string, items: CartItem[]) {
  return timed("order.processing.duration", async () => {
    const order = await createOrder(userId, items);
    incrementCounter("order.created", {
      "order.status": "completed",
      "order.item_count": items.length,
    });
    return order;
  });
}

The timed helper wraps any async function and records duration automatically. Counters track discrete events with attributes for filtering.

Wire Up a Backend

The abstraction pays off when you connect a real metrics backend. Same function signatures, different implementation.

npm install @opentelemetry/sdk-node \
  @opentelemetry/api \
  @opentelemetry/exporter-metrics-otlp-http

metrics.ts

import { metrics, Attributes } from "@opentelemetry/api";

const meter = metrics.getMeter("checkout-api");

const counters = {
  "order.created": meter.createCounter("order.created"),
  "payment.processed": meter.createCounter("payment.processed"),
  "error.count": meter.createCounter("error.count"),
};

const histograms = {
  "order.processing.duration": meter.createHistogram("order.processing.duration", { unit: "ms" }),
  "payment.processing.duration": meter.createHistogram("payment.processing.duration", { unit: "ms" }),
};

export function incrementCounter(
  name: keyof typeof counters,
  attributes?: Attributes
) {
  counters[name].add(1, attributes);
}

export function recordDuration(
  name: keyof typeof histograms,
  ms: number,
  attributes?: Attributes
) {
  histograms[name].record(ms, attributes);
}

export async function timed<T>(
  name: keyof typeof histograms,
  fn: () => Promise<T>,
  attributes?: Attributes
): Promise<T> {
  const start = performance.now();
  try {
    return await fn();
  } finally {
    recordDuration(name, performance.now() - start, attributes);
  }
}

OpenTelemetry exports to Sentry or any OTLP-compatible backend. Your application code imports from ./metrics and never knows the difference.

DONTAdd user.id to metric attributes. Every unique value creates a time series. 10,000 users means 10,000 time series per metric. Your bill and your queries will suffer.

DOAlert on rate of change, not absolute values. "Error rate increased 3x in 5 minutes" catches problems faster than "error count exceeded 1000."