Sauron Self-Alerting

Overview

Sauron offers a set of pre-built alerting rules concerning the health of the Sauron instance itself. These self-alerts are defined in Grafana and routed to Sauron contacts/owners via Alertmanager. However, you can configure these alerts, including whether or not to enable them using the APIs mentioned below.

Enabling/Disabling the Self-Alerting Feature

The overall Self-Alerting feature can be disabled or enabled like this:

curl -XPUT -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting/actions?action=disable
curl -XPUT -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting/actions?action=enable

And to view the enabled status:

curl -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting/status
{"enabled":true}

A Sauron instance is created out of the box with Self-Alerting enabled.

Add custom email address for Self-Alerting

This api will add custom email address to self-alerting which is used in Alertmanager by smtp-receiver to send emails, if toEmails is not provided then contact-email will be used to send the mail.

curl -XPUT -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting/
toEmails?emails=abc@gmail.com

For multiple emails:

curl -XPUT -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting/
toEmails?emails=abc@gmail.com,xyz@gmail.com
Listing Available Alerts

To list the name and description of available alerts:

curl -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting/alerts
[{"description":"Alert on the Prometheus head series dropping greatly","name":"prometheusseriesdropping"}]
Enabling/Disabling Specific Alerts

To disable or enable a specific alert, for example, the prometheusseriesdropping alert:

curl -XPUT -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting/alerts/prometheusseriesdropping/actions?action=disable
curl -XPUT -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting/alerts/prometheusseriesdropping/actions?action=enable

And to view the enabled status:

curl -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting/alerts/prometheusseriesdropping/status
{"enabled":true}

Once you enable an alert, it'll appear activated in Grafana within a minute or two. You'll be able to find the alert in the My Dashboard related to the component that the alert deals with. For example, for prometheusseriesdropping, navigate to the "My Prometheus" Dashboard. An active alert on a panel is indicated by a heart to the left of the panel name:

self alert example

Tweaking Configuration of Specific Alerts

We have limited support for modifying the parameters of the Sauron Self-Alerts. The alert parameters that can be modified are: * severity: an arbitrary string that you can set as needed, based on your Alertmanager configuration (see below) * forMinutes: how many minutes the condition needs to fail for before the alert fires * threshold: the threshold value of the alert condition

Again, using the prometheusseriesdropping alert as an example, to fetch the current configuration:

curl -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting/alerts/prometheusseriesdropping
{"forMinutes":15,"severity":"warning","threshold":"-80"}

This particular alert fires on an 80% drop in metrics. If we want to change this to a 50% drop instead:

curl -XPUT -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting/alerts/prometheusseriesdropping -d '{"severity": "warning", "forMinutes": 15, "threshold": "-50.0"}'
{"forMinutes":15,"severity":"warning","threshold":"-50.0"}
Using Severity in Alertmanager

As mentioned above, you can set the Grafana alert severity of the Sauron Self-Alerts to whatever values you want. A value of "warning" is the default for Sauron Self-Alerts. But it's possible for you to configure more complex Alertmanager configurations that actually map the severity of Grafana alerts to the severity of Ocean alerts, if you choose. For example, an Alertmanager configuration like this will pass the severity of a Grafana Alert onto Ocean as-is, defaulting to a value of "warning":

receivers:
  - name: Testing
    webhook_configs:
    - xxxxxxxxxxx
      severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity| toLower }}{{ else }}warning{{ end }}'
Enabling/Disabling Future Self-Alerts By Default

The Sauron team expects to add more Self-Alerts over time. Once added, a new alert will immediately appear via a call to /v1/selfalerting/alerts (mentioned above). But whether you want these new alerts to be enabled when they become available (which is the default behavior) is up to you. If you'd like to have these future alerts disabled when they become available, you can configure that like this:

curl -XPUT -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting -d '{"enableNewAlerts": false}'

To view whether this is already set:

curl -u sauron:mypassword https://api.handu-phx.handu.developers.oracledx.com/v1/selfalerting
{"enableNewAlerts":false}
Grafana Notification Channel for Alertmanager

Your Grafana is configured out of the box with a Notification Channel for Alertmanager. This is the Notification Channel used by the Sauron Self-Alerts, but you are free to use it for your own Grafana alerts as well.

Alertmanager Configuration

Since Sauron Self-Alerts are routed to Alertmanager, they require an Alertmanager configuration on your part in order to route to something like Ocean, Slack, or email. The following are complete examples of configuring routing to:

If you already have an Alertmanager route and receiver configured, any Self-Alerts will enter that existing receiver stream. If instead you wish to separate out Self-Alerts from your ordinary alerts, you can use the mysauron label to do this. Let's say you have a "Service-Production" receiver configured to route to Ocean:

route:
  receiver: Service-Production
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 3m
receivers:
- name: Service-Production
  webhook_configs:
  - xxxxxxxxxxxxxxxxxx

and you want to configure Self-Alerts to be sent to Slack instead of Ocean. You can do that by adding a Slack receiver (abbreviated example, see here for a full example of a Slack configuration), and a route that sends alerts with the mysauron label to that receiver:

global:
  smtp_smarthost: smtp.us-ashburn-1.oraclecloud.com:587
  smtp_from: sauron-alert@sauron.us-ashburn-1.oracledx.com
  smtp_auth_username: ocid1.user.oc1.xxxxxxxxxx.com
  smtp_auth_password: somepasswd
route:
  receiver: Service-Production
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 3m
  routes:
  - receiver: Slack-Receiver
    group_by: ['alertname']
    match_re:
      mysauron: .+
receivers:
- name: Service-Production
  webhook_configs:
  - xxxxxxxxxxxxxxxxxx
- name: Slack-Receiver
  slack_configs:
  - icon_url: https://alertmanager.handu-phx.handu.developers.oracledx.com
    send_resolved: true
    ...
Prometheus Alerts

In this section, we will list proactive action on Prometheus which can be taken by customer and reduce the noise by itself.

Prometheus Disk Usage High

Default Prometheus storage will be 500 GB per instance and we provide highly available Prometheus-as-a-service which requires two instances. This alert will be triggered whenever Prometheus disk usage reached more than 90% in last 15 minutes.

Possible Action
  • Review your Prometheus configuration. There could be a possibility that you are scraping aggressively, consider increasing the scrape interval.
  • Your metrics might have high cardinality, it's causing so much disk usage. You can open your Prometheus status page and look for highest cardinality Labels and drop those labels that have higher than 1,000 cardinalities. It is advised to keep just 1 label per metric name that has more than 1,000 cardinalities.
  • If the above actions don't work for you, consider to shorten the retention period. You will need to have the Sauron team to change the retention period for you.

If none of the options work for you, please contact Sauron team.

My Sauron - Time Series Dropping

We have configured this alert when Prometheus TSDB Head Series has dropped by 80% in last 1 hour. This means there is some issue at customer end, their scrape either not pushing the data or some scrape configuration changes which consequence drop in seeding/pulling/scarping metric data.

Possible Action

This only needs to be take care by the customer until and unless there were some major outage at backend. Customer can recheck its scrape configuration, server configuration, connectivity to Prometheus/PushProx/Shuttleproxy servers.

OpenSearch Alerts

In this section, we will list proactive action on OpenSearch which can be taken by the customer and reduce the noise by itself.

OpenSearch Disk Usage High

Default threshold will be set 70%. Once it crossed 75%, then Sauron team will get the alert.

Possible Action
  • Please check whether DEBUG mode is enabled or not in any of your apps. If yes, disable it.
  • If you have indices that named differently, such as
    • index name without timestamp
    • index name with timestamp that is outside the supported 3 timestamp-patterns please delete all those indices and fix the log shipper client settings to use one of the 3 supported index patterns:
    • -%{+yyyy.MM.dd}
    • -%{+yyyy-MM-dd}
    • -%{+yyyy_MM_dd}
  • Optimize the data that is being sent to OpenSearch. You can follow the steps described in the documentation about tuning for disk usage.
  • If you are sending logs/metrics from different regions to a single Sauron endpoint (causing increased clients and traffic), Please request new Sauron instances in the specific regions.
  • Consider shortening the retention period if possible. You will need to have the Sauron team to change the retention period for you.
  • It might be possible that logs ingestion actually grown due to increased number of clients. Then raise the ticket on support channel to increase the OpenSearch disk size.
OpenSearch Shard Size Too Large

Size of some OpenSearch shards has been reached to 50 GB and need to be lowered to 30 GB.

Possible Action
  • Please consider increasing the shards count of the indices.
  • Try to filter out the unnecessary messages from the client-end.
  • Split one big index into several smaller indices.
Alertmanager Alerts
Alertmanager Notification Failed

Alertmanager tried to send an alert notification through a customer-configured receiver, but that attempt ended in failure, and the notification was not sent.

Possible Action

When Alertmanager fails to send an alert notification, the Alertmanager logs will normally include some information about why the attempt failed. You can use the Sauron API to fetch the logs.

Some common failure modes are:

  • SMTP authentication failed:
    my-receiver-name/email[0]: notify retry canceled after 16 attempts: *smtp.plainAuth auth: 535 Authentication credentials invalid
    

Use the Sauron API to update your Alertmanager configuration to specify valid SMTP credentials.

  • OCEAN authentication failed:
    my-receiver-name/webhook[0]: notify retry canceled due to unrecoverable error after 1 attempts: unexpected status code 401: https://oceanclient.ocs.oraclecloud.com/api/v1/webhooks/json/prometheus-alertmanager: <html>\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html;charset=ISO-8859-1\"/>\n<title>Error 401 Unauthorized</title>\n</head>\n<body><h2>HTTP ERROR 401 Unauthorized</h2>\n<table>\n<tr><th>URI:</th><td>/api/v1/webhooks/json/prometheus-alertmanager</td></tr>\n<tr><th>STATUS:</th><td>401</td></tr>\n<tr><th>MESSAGE:</th><td>Unauthorized</td></tr>\n<tr><th>SERVLET:</th><td>jersey</td></tr>\n</table>\n\n</body>\n</html>\n
    

Request a new OCEAN token, and use the Sauron API to update your Alertmanager configuration with the new token. See the OCEAN External Webhook Onboarding Guide for more information.

  • Slack authentication failed:
    my-receiver-name/slack[0]: notify retry canceled due to unrecoverable error after 1 attempts: channel \"my-channel\": unexpected status code 400: invalid_token
    

Use the Sauron API to update your Alertmanager configuration to specify valid Slack credentials.

  • Slack reports channel not found:
    my-receiver-name/slack[0]: notify retry canceled due to unrecoverable error after 1 attempts: channel \"my-channel\": unexpected status code 404: channel_not_found
    

Use the Sauron API to update your Alertmanager configuration to specify a valid Slack channel.

  • Slack reports channel has been archived:
    my-receiver-name/slack[0]: notify retry canceled due to unrecoverable error after 1 attempts: channel \"#my-archived-channel\": unexpected status code 410: channel_is_archived
    

Either unarchive the channel in Slack, or use the Sauron API to update your Alertmanager configuration to specify a channel that is not archived.

  • Unable to reach Slack:
    my-receiver-name/slack[0]: notify retry canceled after 17 attempts: Post \"<redacted>\": proxyconnect tcp: dial tcp :0: connect: connection refused
    

Check the Slack URL, and ensure no proxy_url was supplied in the Alertmanager configuration.

OCEAN Token Expiring

An OCEAN token in the customer-supplied Alertmanager configuration has expired, or will expire soon.

Possible Action

Get a new OCEAN token, and use the Sauron API to update the Alertmanager configuration to use the new token. See the OCEAN External Webhook Onboarding Guide for information about how to get a new token.