System Overview APIs enable querying summary metrics that provide general system health of your cluster. Common questions include:
Which services are experiencing the highest rate of errors?
Which services are consuming the most bandwidth / traffic?
Which services are experiencing an unusual increase or decrease in traffic?
These queries can be further filtered based on a number of criteria, enabling you to get these metrics for only the services that you are interested in.
The Flowmill service is built around a time-series database capable of doing flexible aggregations over time. Queries can be configured to cover both a desirable range of time as well as requesting a given number of data points.
In version 1 of the System Overview APIs, time ranges are specified using an inclusive start timestamp, exclusive end timestamp and the duration that a single data point should covert. End and Start may be adjusted to accomodate the number of steps requested. Additionally, samples will be aggregated / summarized as appropriate to reduce the number of data points returned.
In version 2 of the System Overview APIs, time ranges are specified using an exclusive end timestamp, the duration of the time range requested and the number of samples to return. Samples will be aggregated / summarized as appropriate to reduce the number of data points returned.
There are currently 4 summary metrics types available for querying:
HTTP status codes are automatically collected for any service sending or receiving HTTP traffic. The
HTTP_METRIC summary type enables querying for all services experiencing non-zero
500 status code counts.
To also query for successful status codes, such as
200, set the
percent_bad field to
0 in your API request.
Packet loss metrics are automatically collected on all connectios. The
PACKET_LOSS summary type enables querying for all services experiencing packet loss over 1%. This value is configurable via the
percent_bad field in the request.
DNS status metrics are automatically collected for all DNS lookups. The
DNS_STATUS summary type enables querying for all services experiencing non-zero DNS lookup failures. Querying for all DNS outcomes is available by setting the
Successful and failed connection attempts are automatically collected for all connection attempts. The
CONNECTION_FAILURE summary type enables querying on all services experiencing greater than 5% failure on connection attempts. Querying for all connection attempt outcomes is configurable by setting the
percent_bad field to
Flomwill works on aggregated types and will default to grouping services with the same role and within the same availability zone. Using the v2 APIs, you can choose alternate groupings via the
destination_groupings fields. The groupings available are: