Frequently Asked Questions
General
What is Sauron?
Sauron is a CSSAP approved, OCI-based operations platform for monitoring any
workload, but especially workloads running in Kubernetes. Sauron platform
provides logging, metrics, data visualization, alerting, DNS, certificate
management, and external monitoring components. Sauron team provides a 24x7x365
managed service to multiple teams within Oracle. Other usage models are
available; please contact us on #sauron-support channel in Slack for any questions.
- Prometheus: Metrics collection and persistence engine with alert generation
- Thanos: Allows Prometheus to be HA and operate at scale.
- PushProx: Allows pulling metrics from network locked tenancies
- Alertmanager: Alert router
- Grafana: Metrics visualizer
- Pushgateway: Incoming metrics collector
- OpenSearch: Text intake and storage
- OpenSearch Dashboards: Log visualizer
- API: Exposes API to modify configurations
- Help: Sauron documentation
Value add pieces our service provides includes:
- Automated provisioning and life cycle management
- Automated Backups
- Automated DNS management
- HA via Kubernetes and OCI block volume provisioner
- DR via backup recovery process
- Automated SSL Certificate management
- API endpoints exposed to manage each individual instance
- Metric and log based alerting
Sauron uses only CNCF/open-source technology. Sauron is in operational use today by multiple teams within Oracle, including teams that are currently "live" providing services to Oracle customers in the cloud.
At the moment, Sauron is not available to external Oracle customers.
How do I get started?
First, contact our team. The preferred way of contacting us is the #sauron-support channel in Slack.
This channel is in the Proddev-paas-fmw workspace. If you cannot access it,
please email sauron_dev_ww_grp@oracle.com and we'll share the channel with your workspace.
In order to create a Sauron system, our team will require some information:
- Short team name or acronym (included in endpoint domain names) e.g.
fgbuorrsys - Email of the VP of your team
- Email addresses of contacts for the Sauron
- Cost center
- List of regions to provision Sauron
Actual provisioning of a Sauron system takes minutes, so most requests for new Saurons can be honored within a few hours. Once the endpoints are provisioned, our automated provisioner generates a message to the contact email addresses containing details about how to connect.
Would you recommend migrating to Sauron before being fully on OCI?
We do have teams who are not fully on OCI and using us.
What is the pricing model?
At this time we are not yet cross charging for Sauron usage. However at some point in the future internal cross-charging will be established. At a minimum, it is expected that the cross charging will include the OCI charges for running your Sauron instance.
What OCI services does Sauron integrate with off the shelf?
Sauron is internally integrated with most of the native OCI services: LoadBalancers, BlockVolume, Object Storage, Compute, VCN, HealthCheck and others based on the needs of our Sauron platform.
How are responsibilities divided between your team and my team?
Our team:
- provisions your requested Sauron instance
- regularly backs up all monitoring data and system logs to OCI Object Storage
- monitors health of your Sauron installation 24x7x365,
- provides support via Slack.
Your team:
- identifies and pushes relevant metrics/logs of interest to your Sauron
- defines your alerting rules
- installs rules via the Sauron API endpoint
- responds to requests from the Sauron team if problems occur
- builds your own Grafana and OpenSearch Dashboards dashboards.
How is Sauron packaged?
Currently, each managed Sauron runs in its own isolated unique Kubernetes namespace in a given OCI region. We have network policy and other Kubernetes policies defined to ensure that the pods, storage, and other resources in your namespace are not accessible to anyone else running on the same cluster. We have one or more Kubernetes clusters running in all commercial OCI region and in some government regions.
The implementation is completely invisible to our users. The customer view of a Sauron system is simply a set of endpoints, including a self-management API endpoint that allows customer teams to configure their own alerting rules and other features.
What do you offer to help me integrate with Sauron?
Every Sauron comes with personalized help. The help is focused on integration. Our integration documentation is contained in the help. You may be using your Sauron's help to read this FAQ. If not, you are reading this at a centralized location. Try our help or contact us as described above.
Is Sauron CSSAP approved?
Sauron is CSSAP approved. Our CSSAP approval id is 11822.
What SLAs does Sauron service provide?
Our aim is to provide 99.99 SLA. We have business dashboards which monitor the SLA for all our managed instances. Key tenets of our team – it's all about the customer:
- Security
- Operational excellence
- Customer Features
Sauron team uses a 24x7 on call "follow the sun" 8am-8pm schedule with twice daily hand-offs between North America and APAC.
Does Sauron provide a central console to monitor all my endpoints and its current configuration?
Sauron provides a console
endpoint to monitor all your
endpoints. The console is disabled by default; if you want to enable it, please
contact #sauron-support.
How does a service integrate with Sauron?
Logs:
- Sauron team recommends using any client that communicates with OpenSearch. Filebeat and other "Beats", Logstash, and Fluentd are popular among Sauron users.
- Please see https://help.handu-phx.handu.developers.oracledx.com/logs/ for more information.
Metrics:
- To pull metrics, we recommended you set up an intermediate Prometheus server instance ("aggregating Prometheus") in your tenancy. Sauron will scrape from this server. You provide metric data to the aggregating Prometheus via statsd, nodexporter, or other well known metric exporters.
- To push metrics, we recommend setting up PushProx client running on each metrics node
- Please see https://help.handu-phx.handu.developers.oracledx.com/metrics/ for more information.
In Sauron environment:
- If pulling metrics into Sauron, you will need to configure federation targets and scrape intervals in Sauron Prometheus config using the Sauron APIs. This will allow Sauron to scrape from your intermediate Prometheus.
- Configure your alert routes and receivers (Ocean) integration
Where does Sauron run? Does it run in my tenancy?
Sauron does not run in your tenancy. The Sauron team maintains its own OCI tenancy where all Sauron instances run. Sauron users do not need to worry about maintaining the infrastructure where Sauron runs.
What kind of load can Sauron handle?
For metrics, we have seen customer's services generate up to 20 millions time series at 60-second scrape interval.
For logs, we have seen customer's services generate up to 2 terabytes indices per day.
Retentions
What is the default online retention period for my Prometheus metrics in Sauron? Is it configurable?
Default online retention period for Prometheus metrics in Sauron is 30 days. This value is configurable. Please contact #sauron-support for help.
What is the default online retention period for my logs in Sauron? Is it configurable?
Default online retention period for OpenSearch logs is 7 days. This value is configurable. Please contact #sauron-support for help.
What is the default backup retention period for metrics and logs? Are they configurable?
Default backup retention period for both metrics and logs is 90 days. These values are configurable. Please contact #sauron-support for help.
What is the total retention period for metrics and logs?
Total retention period = Online retention period + Backup retention period
For metrics, by default, you can look back 30 + 90 = 120 days. For logs, by default, you can look back 7 + 90 = 97 days.
Can I restore older metrics from the backups outside of the online retention period?
You can restore older metrics as long as metrics are within total retention period. Please contact #sauron-support for help.
Can I restore older logs from the backups outside of the online retention period?
You can restore older logs as long as they are within total retention period. Please see our help for details.
How can I download my logs from OpenSearch?
Please see our help for details.
Where are the backups stored? Are they encrypted?
Backups are stored in OCI Object Storage. OCI Object Storage provides automated encryption in transit and at rest.
Will Sauron automatically retire indices from OpenSearch?
Sauron will automatically prune OpenSearch indices based on certain timestamp patterns in the index name.
What max size a Prometheus can get to before performance significantly downgrades?
Storage used is proportional to ingestion rate i.e. how many samples per sec are sent to Prometheus.
Rough math is 10k samples/s would need 650 GB of storage for 1 year retention assuming 2 bytes per sample.
Our recommendation has always been to stay under 1 TB. Once you cross the 500GB threshold with BVs the query time will keep getting higher and will be pretty much unusable after 1 TB. Prometheus by default does not do any downsampling so all the data is stored in high precision at the configured scrape interval of 30s or 1m.
Best way to support longer retention periods would be to use Thanos with ObjectStorage and downsampling enabled.
Integrations
Does Sauron support OCI metrics querying and alerting
Sauron does support OCI metrics querying and alerting. Please see our help for details.
Does Sauron support ingesting data from AWS?
Sauron supports ingesting data from AWS. We have customers doing this. Their services run on AWS and they use Sauron to monitor them. We will be glad to help if you hit any issue.
Does Sauron provide direct integration with Slack?
Sauron provide direct integration with Slack. Please see our help for further details.
Can Sauron send my alerts directly to email i.e. Alertmanager/Grafana email integration?
Sauron can send your alerts directly to email. Please see our help for further details.
Can I trigger an alert based on logs in OpenSearch?
You can trigger alert based on logs in OpenSearch. Please see our help for further details.
Are there any Sauron SDK/CLI available?
Sauron exposes API endpoint, which exposes various APIs to configure and manage your Sauron instance.
Please see our help for further details.
Can Sauron Prometheus pull/federate metrics from dev boxes running inside Oracle network?
If you have a metrics server running from a lockdown/air-gapped tenancy such that the scrape target is not reachable from Sauron Prometheus, we can use PushProx which works on same Pull model using a Client/Proxy connection.
Please see our help for more details.
Does Sauron support RBAC for Grafana to control access for dashboards, etc?
Sauron supports RBAC integration with Grafana. Please see our help for more details.
Where can I read more about exporters and beats
- Metrics: Exporters and Integrations
- Logs: Lightweight Data Shippers
Now that I have my metrics and logs in Sauron, how do I configure my alerts?
More details here: https://help.handu-phx.handu.developers.oracledx.com/alerts/
Now that the endpoints are set up for my Sauron, how do I create data sources in Grafana?
Sauron Grafana is configured with a default data source which points to Sauron Thanos, which seamlessly merges the data from multiple highly-available Prometheus instances within a Sauron instance. If you need to configure other types of data source, you may follow the detailed instructions here.
Security
We take lot of pride in making sure our platform as secure as it can be. By default all Sauron endpoints are secured by TLS and strong password. We also make it easy to integrate all Sauron UI endpoints with Oracle SSO for additional security. Using OIM entitlements we can also restrict access to Sauron endpoints to only users that you define and manage.
Does Sauron support RBAC in addition to SSO integration?
Grafana provides a limited RBAC capability. We plan to enable some basic RBAC for all endpoints in the future.
How does Sauron handle security patching?
We have automation to apply security patches to all our instances on a monthly basis AND/OR when they become available.
I want to integrate my endpoints with Oracle SSO. How do I do it?
Please see our help for details on how to enable SSO.
I want to change HTTP basic authentication credentials for my Sauron. How do I do it?
The Sauron admin user password can be changed through the API server.
How do you ensure all software you run on your cluster free of known CVEs?
All software we use is:
- built by ourselves directly from trusted source. We make necessary fixes during PLS approval if any CVE is identified.
- built with tools obtained from trusted source (e.g. most recent Golang from Oracle's yum server).
- built on top of hardened and most up-to-date Oracle Linux Container images (e.g. oraclelinux:8-slim).
- scanned by Trivy against the latest CVE database on a daily basis.
- promptly upgraded if any CVE is reported by Trivy scan.
Do you make the build from source images you use available to internal customers?
Please check this page for a list of images that Sauron customer can use.
However, please note that PLS approvals obtained by Sauron team for using such images within Sauron do not extend to our customer's business use case.
You are still responsible for PLS approvals for your own business use case.
Architecture
Do you have a high level architecture overview?
Please see https://help.handu-phx.handu.developers.oracledx.com/architecture/ for architecture overview.
Is Sauron HA?
All our managed endpoints have HA enabled out of the box. OpenSearch clusters have "N" configured master/ingest/data nodes which are run with node and AD anti-affinity policies. Even if compute instances go down or an entire Availability domain in OCI region goes down, our OpenSearch service will continue to run with no downtime.
Prometheus instances also have at least two replica servers. All queries are made through Thanos, which seamlessly merges data from the replica instances.
Sauron team also routinely scales OpenSearch servers in customer instances to handle increased load and add data capacity. These scaling activities are completely transparent to the customer.
Note: our ability to provide high availability in single-AD regions is naturally limited by the physical topology of the region.
How does Sauron handle upgrades?
- We run all Sauron instances on OCI OKE clusters. We upgrade to the latest Kubernetes version offered by OKE shortly after it becomes available. These migrations cause no downtime aside from the possible need to reconnect to a Grafana or OpenSearch Dashboards instances.
- We use the same strategy to perform regular OCI Compute patching on all Kubernetes worker nodes in order to comply with Oracle security policy.
- We obtain PLS approvals for all software. We upgrade the endpoint servers (OpenSearch, Prometheus, etc.) in compliance with Oracle policy about staying current.
Does Sauron have Disaster Recovery support?
Sauron disaster recovery support consists of recovering from backup. Backups are stored in OCI Object Storage in the region where the Sauron instance is hosted. If all OCI connectivity to a region is lost, Sauron does not have the ability to recover data from that region until connectivity is restored.
Migration
Does the Sauron Team provide support for migration?
Only general advice.
There are simple scripts you can write to export Grafana dashboards and other artifacts from one instance to another. Also, OpenSearch provides some APIs and utilities for export and import.
All our metrics data is using InfluxDB in our self hosted cluster; do you support migration to Prometheus?
No. From our knowledge there is no easy migration path from InfluxDB to Prometheus though both of them use the TSDB back end. Grafana supports both InfluxDB and Prometheus data sources. In the short term you may need to configure Grafana to use both data sources until you have sufficient data on your Sauron Prometheus and can turn off the InfluxDB data source pointing to your hosted InfluxDB cluster.
Support
I am facing issues with my Sauron; what do I do?
Post your query/issue in the #sauron-support channel in Slack. Our on-call engineer will get back to you at the earliest.
I didn't get an answer for my question in FAQs, what do I do?
You might want to go through the https://help.handu-phx.handu.developers.oracledx.com/. If
you still have unanswered questions, please post your query/issue in the
#sauron-support channel in Slack. Our on-call engineer will get back to you at the earliest. We are always happy to help.