A common SRE practice is to monitor the latency of a service (what we might call service time) and produce three SLIs:
- P50 – the 50th percentile latency of all requests in each 60 second sample
- P90 – the 90th percentile latency of all requests in each 60 second sample
- P99 – the 99th percentile latency of all requests in each 60 second sample
The actual percentile figures chosen can, of course, be tailored our own needs. With these numbers we might use P50 for trending, P90 in the SLO we advertise to the business and P99 as an internal measure to highlight potential future problems. To put the P90 SLI in the context of an SLO, we might say to business managers:
Our target performance for the Sales Management application is that 90 percent of all response times in each minute will be less than 3 seconds.
Not only can we monitor service response time presented to the users (sometimes referred to as the Total Request Latency), we can also use these SLIs to monitor underpinning services and set internal SLOs accordingly. For example, an internal SLO might read like this:
Our target performance for the Sales Management database is that 90 percent of database queries in each minute must complete in less than 100ms.