As SREs we recommend four types of metric monitoring called The Four Golden Signals.

The list, in priority order, is:

Traffic – demand on a system or service measured in transactions per seconds or something similar and appropriate.

Errors – how many failures are we getting.  This could be a HTTP 500 (Server Failure) error, but equally could be incorrect output.

Latency – response time or service time measured separately. We measure error latency separately on the basis that if it’s going to fail you want it to fail fast.  Also, you don’t want error latency values skewing normal latency.

Saturation – to some extent the same as utilisation e.g. CPU utilisation, although it could be queue length, free memory, etc.  Basically anything that indicates load vs. capacity.