System Overview

System Overview APIs enable querying summary metrics that provide general system health of your cluster. Common questions include:

  • Which services are experiencing the highest rate of errors?

  • Which services are consuming the most bandwidth / traffic?

  • Which services are experiencing an unusual increase or decrease in traffic?

These queries can be further filtered based on a number of criteria, enabling you to get these metrics for only the services that you are interested in.

Specifying a Time Duration and Number of Samples

The Flowmill service is built around a time-series database capable of doing flexible aggregations over time. Queries can be configured to cover both a desirable range of time as well as requesting a given number of data points.

V1 - Start, End, Step Method

In version 1 of the System Overview APIs, time ranges are specified using an inclusive start timestamp, exclusive end timestamp and the duration that a single data point should covert. End and Start may be adjusted to accomodate the number of steps requested. Additionally, samples will be aggregated / summarized as appropriate to reduce the number of data points returned.

V2 - End, Duration, Number of Samples Method

In version 2 of the System Overview APIs, time ranges are specified using an exclusive end timestamp, the duration of the time range requested and the number of samples to return. Samples will be aggregated / summarized as appropriate to reduce the number of data points returned.

Types of Metrics Available

There are currently 4 summary metrics types available for querying:

HTTP Status

HTTP status codes are automatically collected for any service sending or receiving HTTP traffic. The HTTP_METRIC summary type enables querying for all services experiencing non-zero 500 status code counts.

To also query for successful status codes, such as 200, set the percent_bad field to 0 in your API request.

Packet Loss

Packet loss metrics are automatically collected on all connectios. The PACKET_LOSS summary type enables querying for all services experiencing packet loss over 1%. This value is configurable via the percent_bad field in the request.

DNS Status

DNS status metrics are automatically collected for all DNS lookups. The DNS_STATUS summary type enables querying for all services experiencing non-zero DNS lookup failures. Querying for all DNS outcomes is available by setting the percent_bad field.

Connection Failure

Successful and failed connection attempts are automatically collected for all connection attempts. The CONNECTION_FAILURE summary type enables querying on all services experiencing greater than 5% failure on connection attempts. Querying for all connection attempt outcomes is configurable by setting the percent_bad field to 0.

Types of Groupings Available

Flomwill works on aggregated types and will default to grouping services with the same role and within the same availability zone. Using the v2 APIs, you can choose alternate groupings via the source_groupings and destination_groupings fields. The groupings available are:

  • role

  • az

  • env

  • ns