Monday, June 22, 2020

SLx(SLI, SLO, SLA, SLT) and Toil in a nutshell

These terms coined by Google's SRE principles

  • SLx (SL"I/O/A/T")
  • SLI - Service Level Indicator
  • SLO - Service Level Objective
  • SLA - Service Level Agreement
  • SLT - Service Level Targets
  • Error Budget
  • Toil/Toil Budget

SLx

SLx can be drawn with some good monitoring tools and could easily be depicted with some dashboards e.g. Grafana, Newrelic.

SLI (Service Level Indicator)

SLI are quantative measurements.
  • Request latency
  • Batch throughput
  • Failures per request
Example: 95th percentile latency of homepage requests over past 5 minutes < 300ms

SLO ( Service Level Objective) 

Binding target for a collection of SLIs.

Example: 95th percentile homepage SLI will succeed 99.9% over trailing year

SLA (Service Level Agreement)

Business agreement between a customer and service provider typically based on SLOs.

Example: Service credits if 95th percentile homepage SLI succeeds less than 99.5% over trailing year.

SLT (Service Level Targets)

A service level target is a key element of a SLA between you as a service provider and an end user customer. 
SLT measure your performance as a service provider and are designed to avoid disputes between the two parties based on misunderstanding.

Error budget

A rate at which the SLOs can be missed — and track that on a daily or weekly basis.

The main advantage of error budget is that it's a quantitative measurement that's shared between the product and SRE teams, which means that we can balance Innovation(feature rolllouts) and Stability(Reliability) to an appropriate level.

Toil

Toil is the kind of work tied to running a production service that tends to be:
  • Manual
  • Repetitive
  • Automatable
  • Tactical
  • No enduring value
  • and that scales linearly as a service grows