In this article, I will share my thoughts, feedback and ideas following the work I have done on monitoring operands for the SF Operator, in the hopes that other developers looking to add deep insights and automated service tuning to their own operators may build upon my experience.
If you can't measure it, you can't size it properly
Orchestrating applications with Kubernetes opens up a world of possibilities. Among the biggest game changers, according to me, we have:
- Upgrade strategies that also simplify rollback should a problem arise
- Horizontal scaling and load balancing your workload, two often dreaded Ops tasks (I know I do!), become much simpler to handle. More often than not, it's just about changing the replica count in a manifest; your cluster handles the rest under the hood.
Scaling up or down, however, requires knowledge of when it should occur. While Kubernetes' Horizontal Pod Autoscaling can trigger scaling on CPU or memory usage automatically, application deployers with deeper knowledge of their software may want to react on more precise events that can be measured. And that is where monitoring embedded into an operator comes into play.
Operator developers can define their Pods to include a way to emit metrics. They can also use the operator's controllers to configure metrics collection, so that a Prometheus instance will know automatically how to scrape these metrics. Finally, with operations knowledge, the operator can include interesting alerts that will trigger when the application operates outside of its expected behavior.
And when you deploy an application with such an operator, you get all that operating knowledge for free!
The Prometheus Operator
The prometheus operator, unsurprisingly, is truly the cornerstone of enabling monitoring with operators. It provides a declarative API (ie "give me a Prometheus instance!", or "monitor this pod!") that makes it really simple to set up a monitoring environment and work with monitoring resources in an operator's source code.
I would recommend installing the prometheus operator on any Kubernetes cluster that will run applications. You can then spin up a Prometheus instance that will collect metrics emitted on a given namespace and/or from resources matching specific labels.
On OpenShift, the prometheus operator can optionally be installed at deployment time, which will result in a cluster-wide instance of Prometheus that can collect application metrics automatically.
Exposing your operands' metrics
In the development of the SF-Operator, we face three categories of operands when it comes to metrics:
- The operand's underlying application(s) emit prometheus metrics
- The operand's underlying application(s) do not emit relevant metrics, and we desire Pod-related metrics
- The operand's underlying application(s) emit statsD metrics
Let's dive into the details of each case.
The Operand emits prometheus metrics
This is the case for Zuul. It is truly the simplest case since it is enough to:
- ensure emitting the metrics is enabled in the operand's configuration
- ensure the right port is declared in the relevant container spec
We could also add a route to enable an external Prometheus to scrape the metrics endpoint, but since we target OpenShift we make the assumption that a Prometheus instance that is internal to the cluster will be used.
The Operand emits statsD metrics
This is the case with Nodepool and Zuul. For simplicity's sake, we would like to aggregate all metrics in Prometheus. This can be done easily with a sidecar container running StatsD Exporter. All you need is a mapping configuration file that will tell the exporter how to translate statsD metrics into prometheus metrics - especially where the labels are in the original metric's name. Once again, all you need then is to expose the exporter's service port and your metrics are ready to be scraped.
Like for Node Exporter, we created a helper function called "MkStatsdExporterSideCarContainer" that makes it easy to emit statsd metrics from a Pod in a Prometheus-friendly format.
Making sure the metrics will be collected
In the last paragraph, we made sure our metrics can be scraped from our Pods. Thanks to the prometheus operator, we can go one step further and tell any Prometheus instance running on the cluster how to pick these metrics up.
The prometheus operator defines the PodMonitor and the ServiceMonitor custom resources that, as their names suggest, will define how to monitor a given pod or service. Since as I said earlier, we didn't deem necessary to create services for each monitoring-related port, we opted to manage PodMonitors in the SF-Operator. All you need is to specify the "monitoring" ports' names to scrape on the Pod, and set a label selector (in our case, every PodMonitor related to a SF deployment will have a label called sf-monitoring set to the name of the monitored application).
If a cluster-wide Prometheus instance exists, for example if you're using an OpenShift cluster with this feature enabled, you can then access metrics from your SF deployment as soon as it is deployed. Otherwise you can use the sfconfig prometheus CLI command to deploy a tenant-scoped Prometheus instance with the proper label selector configured to scrape only SF-issued metrics.
Injecting monitoring knowledge into the operator
So far, we've seen how deploying our application with an operator allowed us to also pre-configure the monitoring stack. We're emitting metrics and collecting them, but what should we do with this window on our system?
We should, obviously, define alerts so that we can know when the application is not running optimally, or worse. And as you probably guessed already, there's a prometheus-operator defined Custom Resource for that: the PrometheusRule.
The resource is very straightforward to use, as can be seen in the log server controller for example. Once again, we scope our PrometheusRules to the sf-monitoring label and they will be picked up automatically by the right Prometheus instance.
What's great is that with these rules, developers of an operator can inject their knowledge and expertise about an application's expected behavior. My team and I have been running Zuul and Nodepool at scale for several large deployments for years, so we know a thing or two about what's interesting to monitor and what should warrant immediate remediation action. Now we can easily add this knowledge in a way that future deployers can benefit from almost immediately.
At the time of this writing, the base foundations of the monitoring stack in SF-Operator have just landed in the code base. Now that this is over with, I'd like to experiment further with the following:
The kubebuilder documentation about metrics explains how to publish default performance metrics for each controller in an operator. It is also possible to add and emit custom metrics.
On a purely operational level, these metrics are less interesting to us than operands metrics. However, it would probably be good to keep an eye on ticks on controller_runtime_reconcile_errors_total and on the evolution of controller_runtime_reconcile_time_seconds for performance fluctuations.
This is where the fun begins! The KEDA operator greatly expands the capabilities of Kubernetes' Horizontal Pod Autoscaler. While HPA relies on basic metrics like Pod CPU or memory use (or requires some additional effort to work with custom metrics), KEDA allows you to trigger your autoscaling with a lot more event types.
And among them... Prometheus queries.
We could provide predefined KEDA triggers based on relevant queries like NotEnoughExecutors to start spawning new executors when this alert fires.
Log server autoresize
So far we have only considered metrics-driven scaling of pods horizontally. This works especially well for stateless applications, or stateful applications that have a strategy to configure the first deployed pod as a primary node or master, and every extra pod as a replica or slave. But the log server application isn't stateless (logs are stored) and a primary/replicas architecture would be hard, if not impossible, to implement correctly with HTTPD and SSH. And as stated before, Apache and SSH are virtually never bottlenecks for the Log server; but storage is. Kubernetes, and OpenShift as well for that matter, do not seem to address this need for storage autoscaling.
But since we deploy the Log server via an operator, it might be possible to circumvent this limitation like so:
- in the Log server controller's reconcile loop, use the RESTClient library or some other way to query the /metrics endpoint on the node exporter sidecar, or simply run du or similar
- compute how much free space is available
- if the value is under 10% for a given period, increase the log server's persistent volume's size by a predefined increment
- reconcile again later to check free space and repeat
If these experimentations are successful, the day to day operation of our Zuul deployments is going to be so much easier!
I must say that working with the operator framework and monitoring, while a bit scary initially, is starting to make so much sense in the long run, and is even beginning to feel exciting, considering all the open possibilities to make the operations side of my work much easier.
I feel like orchestration with Kubernetes and OpenShift is to managing applications what packaging RPMs has been to installing said applications: a lot of effort for packagers and operator developers, but deployers' lives are made so much easier for it. Kubernetes and OpenShift take it to the next level by adding the opportunity to inject lifecycle and management "intelligence", leading potentially to applications being able to "auto-pilot", freeing your time to focus on the really cool stuff.
I am really looking forward to experimenting and discovering more of what operators can offer.