Google Cloud Operations for GKE by Example

John Tucker
codeburst
Published in
10 min readJan 17, 2021

--

Exploring Kubernetes monitoring on Google Cloud Platform (GCP) through a concrete example.

Please note: For exploring Kubernetes logging on GCP, there is another article: Google Kubernetes Engine Logging by Example.

When we create a Kubernetes cluster on Google Kubernetes Engine (GKE), Google Cloud Operations for GKE is enabled by default.

Google Kubernetes Engine (GKE) includes native integration with Cloud Monitoring and Cloud Logging. When you create a GKE cluster, Cloud Operations for GKE is enabled by default and provides a monitoring dashboard specifically tailored for Kubernetes.

— GCP — Overview of Google Cloud’s operations suite for GKE

Please note: As Legacy Logging and Monitoring is deprecated and will be decommissioned on March 31, 2021, this article does not cover it.

One area of confusion is the relationship between Stackdriver and Google Cloud Monitoring and Cloud Logging.

Our suite of operations products has come a long way since the acquisition of Stackdriver back in 2014. The suite has constantly evolved with significant new capabilities since then, and today we reach another important milestone with complete integration into the Google Cloud Console. We’re now saying goodbye to the Stackdriver brand, and announcing an operations suite of products, which includes Cloud Logging, Cloud Monitoring, Cloud Trace, Cloud Debugger, and Cloud Profiler.

— GCP — All together now: our operations products in one place

While the following video, Kubernetes Apps on Day 2: Effective Monitoring and Troubleshooting with Stackdriver (Cloud Next ‘18), is a couple of years old, it provides an excellent overview of Google Cloud Operations for GKE.

From the video, we learn that can explore three different aspects of Google Cloud’s operations suite for GKE:

  • Basic Usage
  • Integration with Grafana
  • Integration with Prometheus

If you wish to follow along, you will need to have access to a GKE cluster with Cloud Operations Suite GKE is enabled and have downloaded the hello-cloud-ops-gke project.

Basic Usage

To explore the basic usage of Cloud Operations Suite GKE, we will deploy a sample workload running two Apache web servers exposed by the app-1 service.

From the project’s root folder, we start by creating the namespace-1 namespace by executing:

$ kubectl apply -f namespace-1-namespace.yaml
namespace/namespace-1 created

We then create the sample workload in the namespace by executing:

$ kubectl apply -f app-1
service/app-1 created
configmap/app-1 created
deployment.apps/app-1 created

We can load a web page served by the workload by first port forwarding our workstation’s port 8080 to the app-1 service by executing:

$ kubectl port-forward service/app-1 8080:80 -n namespace-1
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

We then load the web page in a browser.

While we could repeatedly reload the browser to generate traffic to the Apache web servers, it is more efficient to automate the process by creating a job.

$ kubectl apply -f load-job.yaml
job.batch/load created

This job generates 100 queries per second of traffic to the Apache web servers for five minutes.

We can monitor the sample workload using the Kubernetes Engine > Workloads menu in Google Cloud Console.

The CPU graph charts various sums of metrics of the two apache containers:

  • The blue line, with a value of 0.12, is the sum of their CPU limits; each container contributing 60m or 0.06 CPU
  • The red CPU line, with a value of 0.08, is the sum of their CPU requests; each container contributing 40m or 0.04 CPU
  • The orange CPU line is the sum of their usage; here we see the spike in CPU usage due to the traffic generated by the job

The Memory graph similarly charts the sum of memory metrics of the two apache containers; each container contributing 16Mi for both memory limits and requests. The combined usage is approximately a constant 12Mi.

We can also monitor the sample workload using the Monitoring > Dashboards > GKE menu. We can quickly drill down to monitor the sample workload; here we filter by cluster and namespace.

From the filtered screen, we select the app-1 workload and select Metrics.

Things to observe:

  • Here, instead of being summed, the two apache container metrics are shown separately
  • Here the usage graphs are presented as a percentage of the request, percentage of the limit, and usage

Finally, we can create custom graphs using the Monitoring > Metrics Explorer menu. Here we recreate the CPU usage graph from above.

Things to observe:

  • The CPU usage time metric is actually a cumulative metric and this graph represents the rate of change of it over time
  • Custom graphs can be saved and organized into custom dashboards

Integration with Grafana

In the previous examples, we used Cloud Monitoring to both store and visualize the metric data. Here we use Grafana to visualize the metric data stored in Cloud Monitoring.

Grafana is open source visualization and analytics software. It allows you to query, visualize, alert on, and explore your metrics no matter where they are stored. In plain English, it provides you with tools to turn your time-series database (TSDB) data into beautiful graphs and visualizations.

— Grafana Labs — Getting Started

Grafana ships with built-in support for Google Cloud Monitoring. Just add it as a data source and you are ready to build dashboards for your Google Cloud Monitoring metrics.

— Grafana Labs — Cloud Monitoring

In this example, we follow the instructions on using a Google Service Account Key to authenticate to Cloud Monitoring instead of using a GCE Default Service Account (or in particular using Workload Identity as we will run the Grafana workload on the cluster). The latter is the preferred approach; it, however, is a bit more complicated.

With the Google Service Account JSON key file on hand, we will deploy a Grafana workload exposed by the grafana service.

We then create the Grafana workload in the namespace by executing:

$ kubectl apply -f grafana
service/grafana created
deployment.apps/grafana createdTODO

We can load the Grafana UI by first port forwarding our workstation’s port 3000 to the grafana service by executing:

$ kubectl port-forward service/grafana 3000:3000 -n namespace-1
Forwarding from 127.0.0.1:3000 -> 3000
Forwarding from [::1]:3000 -> 3000

Using a browser we log in to the Grafana UI with the username/password of admin/admin.

Using the Google Service Account JSON, we can add the Google Cloud Monitoring data source.

Please note: The Google Cloud Monitoring data source ships with a pre-built GKE Cluster Monitoring dashboard that can be imported; we, however, are not going to use it for this example.

To make the graph more interesting, we can first generate traffic to the Apache web servers.

$ kubectl apply -f load-job.yaml
job.batch/load created

We create a new dashboard and panel; we will work to recreate the graph of the app-1 workload CPU usage as we did in Cloud Monitoring. The query is constructed by updating:

Service: Kubernetes
Metric: CPU usage time
Filter: resource.label.cluster_name = cluster-1 AND resource.label.namespace_name = namespace-1
Aggregation: none
Advanced Options > Aligner: rate
Alias By: {{resource.label.pod_name}}

Things to observe:

  • This graph includes all the pods in the namespace-1 namespace
  • Unlike Cloud Monitoring, it does not appear we can filter by user metadata labels, i.e., the labels we add to Kubernetes pods

Integration with Prometheus

So far we have been monitoring Kubernetes container metrics; how might we monitor workload-specific metrics? As an example, we might want to monitor the rate of requests handled by the Apache servers.

The solution we will consider here is to have the Apache servers provide metrics in the Prometheus exposition format and configure Cloud Monitoring to consume them.

Prometheus is a monitoring tool often used with Kubernetes. If you configure Cloud Operations for GKE and include Prometheus support, then the metrics that are generated by services using the Prometheus exposition format can be exported from the cluster and made visible as external metrics in Cloud Monitoring.

— GCP — Using Prometheus

As we will have to modify our workload a bit for this example, we delete the workload that we have been using.

$ kubectl delete -f app-1
service "app-1" deleted
configmap "app-1" deleted
deployment.apps "app-1" deleted

Unfortunately, the official instructions, Using Prometheus, are fairly terse; here we will go through the steps in more detail.

First, we need to enable Workload Identity for the GKE cluster as documented in Using Workload Identity. If one is using a test cluster, it is easier to recreate the cluster with Workload Identity enabled.

Next, we need to create a GCP service account with permissions to interact with Cloud Monitoring and Cloud Logging as documented in Hardening your cluster’s security; we name the service account prometheus.

We then need to allow the Kubernetes service account (we will create later) to impersonate the GCP service account as documented in Using Workload Identity. Here the namespace is namespace-1 and the Kubernetes service account name will be prometheus.

The sample workload is updated with a sidecar container, apache-exporter, that provide Apache server metrics in the Prometheus exposition format.

We then create the updated sample workload in the namespace by executing:

$ kubectl apply -f app-1-prometheus
service/app-1 created
configmap/app-1 created
deployment.apps/app-1 created

As before, the app-1 service exposes the Apache servers on port 80; in addition, it exposes the Apache servers’ metrics in Prometheus exposition format on port 9117.

We can examine these metrics by first port forwarding our workstation’s port 9117 to the app-1 service’s port 9117 by executing:

$ kubectl port-forward service/app-1 9117:9117 -n namespace-1
Forwarding from 127.0.0.1:9117 -> 9117
Forwarding from [::1]:9117 -> 9117

We then load the metrics in a browser:

We now need to run a Prometheus server in the cluster; the workload is a bit complicated.

Things to observe:

  • service: prometheus: The services exposes a Prometheus web UI on port 9090 and the sidecar’s Prometheus exposition format metrics on port 9091
  • configmap: prometheus: This is the Prometheus configuration that configures Prometheus to scrape its own metrics as well as all the Kubernetes service endpoints (the pods) that expose metrics in the Prometheus exposition format (including our updated apache workload). I found the article Monitoring Your Apps in Kubernetes Environment with Prometheus helpful in creating this configuration
  • clusterrole: prometheus: This defines the Kubernetes permissions required by the Prometheus server. The appropriate permissions are explained in the previously mentioned article
  • serviceaccount: prometheus: The Prometheus server runs as this Kubernetes service account. This service account is also bound to the GCP service account we created earlier using Workload Identity; we need to update the project id in the service account identifier for our GCP project
  • clusterrolebinding: prometheus: This is what binds the clusterrole to the serviceaccount
  • deployment: prometheus: The prometheus and sidecar container configurations are described in the article Using Prometheus; here we need to update the project id, cluster location, and cluster name

We then create the Prometheus workload in the namespace by executing:

$ kubectl apply -f prometheus
service/prometheus created
clusterrole.rbac.authorization.k8s.io/prometheus created
serviceaccount/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created
configmap/prometheus created
deployment.apps/prometheus created

We can load the Prometheus web UI by first port forwarding our workstation’s port 9090 to the prometheus service by executing:

$ kubectl port-forward service/prometheus 9090:9090 -n namespace-1
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090

We then load the Prometheus web UI in a browser and inspect the rate of Apache server access of the app-1 workload.

Time series are identified by name, here apache_accesses_total, and series of labels, here instance and job. Under the hood, however, we can observe that these metrics also have label names beginning with __ that are reserved for internal use.

The sidecar container of the Prometheus workload is the process that uploads these Prometheus metrics into GCP Cloud Monitoring.

The Stackdriver collector for Prometheus constructs a Cloud Monitoring MonitoredResource for your Kubernetes objects from well-known Prometheus labels.

— GCP — Using Prometheus

The metrics are accessible as the Kubernetes Container resource type and with metric name prefixed with external/prometheus/, e.g., external/prometheus/apache_accesses_total.

Through trial and error, we can determine that these metrics also include the following labels constructed from the related internal use Prometheus metric labels.

  • cluster_name
  • namespace_name
  • pod_name
  • container_name

To make the graph more interesting, we can first generate traffic to the Apache web servers.

$ kubectl apply -f load-job.yaml
job.batch/load created

We can now create a custom graph using the Monitoring > Metrics Explorer menu to chart the rate of Apache server access of the app-1 workload.

The custom graph reflects this traffic; again 100 queries per second across both apache workloads.

Wrap Up

Hope you found this useful; it was a rather challenging article to write.

--

--