Part VIII: Prometheus/Grafana and Loki
For monitoring of the cluster and logs, we’ll setup several tools…
Prometheus/Grafana
First, we’ll create a login account for accessing (Grafana) UI. Create area and build password to store in a secret:
cd ~/workspace/picluster poetry shell mkdir -p ~/workspace/picluster/monitoring/kube-prometheus-stack cd ~/workspace/picluster/monitoring/kube-prometheus-stack/ kubectl create namespace monitoring echo -n ${USER} > ./admin-user echo -n 'PASSWORD' > ./admin-password # Change to desired password kubectl create secret generic grafana-admin-credentials --from-file=./admin-user --from-file=admin-password -n monitoring rm admin-user admin-password
Add the helm repo:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update
Install kube-prometheus-stack with the credentials we selected, and using Longhorn for persistent storage (allocating 50GB):
helm install prometheusstack prometheus-community/kube-prometheus-stack --namespace monitoring \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName="longhorn" \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.accessModes[0]="ReadWriteOnce" \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage="50Gi" \ --set grafana.admin.existingSecret=grafana-admin-credentials
Wait for everything to come up with either of these:
kubectl --namespace monitoring get pods -l "release=prometheusstack" kubectl get all -n monitoring
At this point, you can look at the Longhorn console, and see that there is a 50GB volume created for Prometheus/Grafana. As with any Helm install, you can get the values used for the chart with the following command and then do a helm update with the -f option and the updated yaml file:
helm show values prometheus-community/kube-prometheus-stack > values.yaml
Next, change the Grafana UI from ClusterIP to NodePort (or LoadBalancer, if you have set that up):
kubectl edit -n monitoring svc/prometheusstack-grafana
From a browser, you can use a node port IP and the port shown in the service output to access the UI and log in with the credentials you created in the above steps:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 5m34s service/prometheus-operated ClusterIP None <none> 9090/TCP 5m33s service/prometheusstack-grafana NodePort 10.233.22.171 <none> 80:32589/TCP 5m44s ...
For example, “http://10.11.12.190:32589” in this example. There will already be a data source set up for Prometheus, and you can use this to examine the cluster. Under Dashboards, there are some predefined dashboards, and you can also make your own and obtain others and import them.
I found this one, from David Calvert (https://github.com/dotdc/grafana-dashboards-kubernetes.git), with some nice dashboards for nodes, pods, etc. I cloned the repo to my monitoring directory, and then from the Grafana UI, clicked on the plus sign at the top of the main Grafana page and selected “Import Dashboard”, clicked on the dag/drop pane, navigated to one of the dashboard json files – the k8s-views-global.json is nice, selected the predefined “Prometheus” data source, and clicked “Import”. This gives a screen with info on the nodes, network, etc.
TODO: Setting up Prometheus to use HTTPS only.
LOKI
For log aggregation, we can install Loki, persisting information to longhorn. I must admit, I struggled getting this working and, although repeatable, there may be better ways to get this working:
helm repo add grafana https://grafana.github.io/helm-charts helm repo update helm upgrade --install loki grafana/loki-stack \ --namespace monitoring \ --set loki.persistence.enabled=true \ --set loki.persistence.storageClassName=longhorn \ --set loki.persistence.size=20Gi \ --set 'promtail.tolerations[0].key=CriticalAddonsOnly' \ --set 'promtail.tolerations[0].operator=Exists' \ --set 'promtail.tolerations[0].effect=NoExecute' \ --set 'promtail.tolerations[1].key=node-role.kubernetes.io/control-plane' \ --set 'promtail.tolerations[1].operator=Exists' \ --set 'promtail.tolerations[1].effect=NoSchedule'
You can check the monitoring namespace, to wait for the promtail pods to be running on each node. Once running, you can access the Grafana UI and create a new datasource, selecting the type “Loki”. For the URL, use “http://loki:3100” and click save and test. I’m not sure why the Helm install didn’t automatically create the source, and why this manual source creation fails on the “test” part of the save and test, but the source is now there and seems to work.
To use, you can go to the Explore section, and provide a query. With the “builder” (default query mode), you can select the label type (e.g. “pod”) and then the instance you are interested in. You can also change from “builder” to “code” and enter this as the query and run the query:
{stream=”stderr”} |= `level=error
`
This will show error logs from all nodes. For example (clicking on “>” symbol at left to expand entry to show the fields:
Another query that reports errors over a period of five minutes is:
count_over_time({stream="stderr"} |= `level=error` [5m])
Customizing Logging Aggregation
You can customize the promtail pod’s configuration file so that you can do queries on custom labels, instead of searching for specific text in log messages. To do that, we first obtain the promtail configuration:
cd ~/workspace/picluster/monitoring kubectl get secret -n monitoring loki-promtail -o jsonpath="{.data.promtail\.yaml}" | base64 --decode > promtail.yaml
Edit this file, and under the “scrape_configs” section, you will see “pipeline_stages”:
scrape_configs: # See also https://github.com/grafana/loki/blob/master/production/ksonnet/promtail/scrape_config.libsonnet for reference - job_name: kubernetes-pods pipeline_stages: - cri: {} kubernetes_sd_configs:
We will add a new stage called “- match”, under the “- cri:” line. To do the matching for a app called “api” we use TODO: Have a app with JSON and then describe the process. Use https://www.youtube.com/watch?v=O52dseg2bJo&list=TLPQMjkxMTIwMjOvWB8m2JEG4Q&index=7 for reference.
Uninstalling
To remove Prometheus and Grafana you must remove several CRDs,helm uninstall, and remove the secret:
kubectl delete crd alertmanagerconfigs.monitoring.coreos.com kubectl delete crd alertmanagers.monitoring.coreos.com kubectl delete crd podmonitors.monitoring.coreos.com kubectl delete crd probes.monitoring.coreos.com kubectl delete crd prometheusagents.monitoring.coreos.com kubectl delete crd prometheuses.monitoring.coreos.com kubectl delete crd prometheusrules.monitoring.coreos.com kubectl delete crd scrapeconfigs.monitoring.coreos.com kubectl delete crd servicemonitors.monitoring.coreos.com kubectl delete crd thanosrulers.monitoring.coreos.com helm uninstall -n monitoring prometheus-stack kubectl delete secret -n monitoring grafana-admin-credentials
Top remove Loki, you can helm uninstall:
helm uninstall -n monitoring loki
To clean up anything else that remains, you can remove the namespace:
kubectl delete ns monitoring