Pattern: Service Level Agreement

How can an API client learn about the specific quality-of-service characteristics of an API and its endpoint operations? How can these characteristics, and the consequences of not meeting them, be defined and communicated in a measurable way?


The final version of this pattern is featured in our book Patterns for API Design: Simplifying Integration with Loosely Coupled Message Exchanges.

Pattern: Service Level Agreement

a.k.a. Quality-of-Service Policies, Explicit and Structured Quality Goals

Context

An API contract or an API Description have been defined for the API, including the functional interface specification (i.e., request and response messages with parameters) of the operations. However, the dynamic behavior of the API operations has not been articulated precisely yet in terms of qualitative and quantitative Quality-of-Service (QoS) characteristics. Furthermore, the support of the service along its lifecycle has not been precisely articulated either (e.g., guaranteed lifetime and mean time to repair).

Problem

How can an API client learn about the specific quality-of-service characteristics of an API and its endpoint operations? How can these characteristics, and the consequences of not meeting them, be defined and communicated in a measurable way?

Forces

Partially conflicting concerns make it hard to specify QoS characteristics in a way that is acceptable both for clients and providers:

  • Business agility and vitality
  • Attractiveness from the consumer point of view
  • Availability
  • Performance and scalability
  • Security and privacy
  • Government regulations and legal obligations
  • Cost-efficiency and business risks from a provider point of view

Pattern forces are explained in depth in the book.

Solution

As an API product owner, establish a structured, quality-oriented Service Level Agreement that defines testable service-level objectives.

Sketch

A solution sketch for this pattern from pre-book times is:

Example

Imagine a fictitious SaaS provider, offering a salary administration software including an API for a payroll service. The provider states that:

“The payroll service has a response time of maximally 0.93 seconds.”

The response time might need some clarification:

“The response time is measured from the time the request arrives at the API endpoint until the response has been fully processed.”

Note that this does not include the time it takes for the request and response to travel across the network from the provider’s API endpoint to the client’s endpoint. Furthermore, the provider assures:

“The Payroll SLO will be met for 99% of the requests, otherwise the customer will receive a discount credit of 10% on the current billing period. To receive a credit the customer must submit a claim to our customer support center including the dates and times of the incident.”

Are you missing implementation hints? Our papers publications provide them (for selected patterns).

Consequences

The resolution of pattern forces and other consequences are discussed in our book.

Known Uses

Many Public APIs on the Web do not expose explicit SLAs, but ask their users to agree with their terms and conditions, which may cover related topics. Usually, no hard guarantees are given; the SLOs are only outlined and not specified formally. However, many public cloud providers have explicit SLAs, for instance Amazon Web Services (AWS) and Microsoft Azure. At the time of writing, SLAs are provided by these public cloud providers and offerings:

  • Amazon EC2 commits to an SLO of a “Monthly Uptime Percentage” that is specified in terms of “minutes during the month in which Amazon EC2 [..] was in the state of Region Unavailable”, which is further specified in the agreement.
  • Microsoft Azure SLA for Functions also defines a “Monthly Uptime Percentage” that is calculated as “Monthly Uptime % = (Maximum Available Minutes-Downtime)/(Maximum Available Minutes) x 100”. The SLA goes on to further specify downtime and “Maximum Available Minutes”. It limits the liability of Microsoft for downtimes to service credits as the only remedy for SLA violations.
  • The combination of a precise uptime guarantee with service credits as the only compensation is a commonly used SLA variant. One examples using it is Singlewire.
  • Google Compute Engine gives a similar uptime guarantee but makes it conditional on the client having its instances “hosted across two or more zones in the same region combined with the inability to launch replacement Instances in any zone in that region”. Downtime is measured as “a period of one or more consecutive minutes of Downtime. Partial minutes or Intermittent Downtime for a period of less than one minute will not be counted towards any Downtime Periods.”
  • SLAs are commonly used in strategic outsourcing and application management services (Miksovic and Zimmermann (2011)), for instance to govern help desk response times and patch delivery (e.g., bug fixes of different severities).
  • Known uses can also be found in the database/information management community and Chapter 5 of Lehner and Sattler (2013).
  • Optimizely defines in its Service Agreement that it “agrees to maintain commercially reasonable technical and organizational measures designed to secure its systems from unauthorized disclosure and modification” and lists a few such measures explicitly like “storing Customer Data on servers located in a physically secured location” and “using firewalls, access controls, and similar security technology”. SLA parts about security often stay at this level of an SLA with informally specified SLOs, as security is a quality that is hard to quantify.

More Information

Beyer et al. (2016) devotes an entire chapter to Service Level Objectives, including measurements for them, called Service Level Indicators (SLIs). A post in the Google Cloud Platform Blog covers SLA, SLO and SLI management as well.

Related Patterns

A Service Level Agreement accompanies the API Description. SLAs may govern the usage of instances of many patterns in this pattern language, such as those in the representation and quality categories.

The details of Rate Limits and Pricing Plans can be included in a Service Level Agreement.

References

Beyer, Betsy, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016. Site Reliability Engineering: How Google Runs Production Systems. 1st ed. O’Reilly.
Cervantes, Humberto, and Rick Kazman. 2016. Designing Software Architectures: A Practical Approach. 1st ed. Addison-Wesley.
C-SIG. 2014. “Cloud Service Level Agreement Standardisation Guidelines.” Cloud Select Industry Group, Service Level Agreements Subgroup; European Commission. https://ec.europa.eu/newsroom/dae/redirection/document/6138.
Data, Dimension. 2013. “Comparing Public Cloud Service Level Agreements.”
Fehling, Christoph, Frank Leymann, Ralph Retter, Walter Schupeck, and Peter Arbitter. 2014. Cloud Computing Patterns: Fundamentals to Design, Build, and Manage Cloud Applications. Springer.
Lehner, Wolfgang, and Kai-Uwe Sattler. 2013. “Cloud-Specific Services for Data Management.” In Web-Scale Data Management for the Cloud, 137–60. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4614-6856-1_5.
Miksovic, Christoph, and Olaf Zimmermann. 2011. “Architecturally Significant Requirements, Reference Architecture, and Metamodel for Knowledge Management in Information Technology Services.” In Proc. 9th Working IEEE/IFIP Conference on Software Architecture (WICSA), 270–79. https://doi.org/10.1109/WICSA.2011.43.