January 18

Part VIII: Prometheus/Grafana and Loki

January 28, 2025

For monitoring of the cluster and logs, we’ll setup several tools…

Prometheus/Grafana

First, we’ll create a login account for accessing (Grafana) UI. Create area and build password to store in a secret:

cd ~/workspace/picluster
poetry shell

mkdir -p ~/workspace/picluster/monitoring/kube-prometheus-stack
cd ~/workspace/picluster/monitoring/kube-prometheus-stack/
kubectl create namespace monitoring
echo -n ${USER} > ./admin-user
echo -n 'PASSWORD' > ./admin-password # Change to desired password
kubectl create secret generic grafana-admin-credentials --from-file=./admin-user --from-file=admin-password -n monitoring
rm admin-user admin-password

Add the helm repo:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install kube-prometheus-stack with the credentials we selected, and using Longhorn for persistent storage (allocating 50GB):

helm install prometheusstack prometheus-community/kube-prometheus-stack --namespace monitoring \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName="longhorn" \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.accessModes[0]="ReadWriteOnce" \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage="50Gi" \
  --set grafana.admin.existingSecret=grafana-admin-credentials

Wait for everything to come up with either of these:

kubectl --namespace monitoring get pods -l "release=prometheusstack"
kubectl get all -n monitoring

At this point, you can look at the Longhorn console, and see that there is a 50GB volume created for Prometheus/Grafana. As with any Helm install, you can get the values used for the chart with the following command and then do a helm update with the -f option and the updated yaml file:

helm show values prometheus-community/kube-prometheus-stack > values.yaml

Next, change the Grafana UI from ClusterIP to NodePort (or LoadBalancer, if you have set that up):

kubectl edit -n monitoring svc/prometheusstack-grafana

From a browser, you can use a node port IP and the port shown in the service output to access the UI and log in with the credentials you created in the above steps:

NAME                                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
service/alertmanager-operated                      ClusterIP   None            <none>        9093/TCP,9094/TCP,9094/UDP   5m34s
service/prometheus-operated                        ClusterIP   None            <none>        9090/TCP                     5m33s
service/prometheusstack-grafana                    NodePort    10.233.22.171   <none>        80:32589/TCP                 5m44s
...

For example, “http://10.11.12.190:32589” in this example. There will already be a data source set up for Prometheus, and you can use this to examine the cluster. Under Dashboards, there are some predefined dashboards, and you can also make your own and obtain others and import them.

I found this one, from David Calvert (https://github.com/dotdc/grafana-dashboards-kubernetes.git), with some nice dashboards for nodes, pods, etc. I cloned the repo to my monitoring directory, and then from the Grafana UI, clicked on the plus sign at the top of the main Grafana page and selected “Import Dashboard”, clicked on the dag/drop pane, navigated to one of the dashboard json files – the k8s-views-global.json is nice, selected the predefined “Prometheus” data source, and clicked “Import”. This gives a screen with info on the nodes, network, etc.

TODO: Setting up Prometheus to use HTTPS only.

LOKI

For log aggregation, we can install Loki, persisting information to longhorn. I must admit, I struggled getting this working and, although repeatable, there may be better ways to get this working:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install loki grafana/loki-stack \
  --namespace monitoring \
  --set loki.persistence.enabled=true \
  --set loki.persistence.storageClassName=longhorn \
  --set loki.persistence.size=20Gi \
  --set 'promtail.tolerations[0].key=CriticalAddonsOnly' \
  --set 'promtail.tolerations[0].operator=Exists' \
  --set 'promtail.tolerations[0].effect=NoExecute' \
  --set 'promtail.tolerations[1].key=node-role.kubernetes.io/control-plane' \
  --set 'promtail.tolerations[1].operator=Exists' \
  --set 'promtail.tolerations[1].effect=NoSchedule'

You can check the monitoring namespace, to wait for the promtail pods to be running on each node. Once running, you can access the Grafana UI and create a new datasource, selecting the type “Loki”. For the URL, use “http://loki:3100” and click save and test. I’m not sure why the Helm install didn’t automatically create the source, and why this manual source creation fails on the “test” part of the save and test, but the source is now there and seems to work.

To use, you can go to the Explore section, and provide a query. With the “builder” (default query mode), you can select the label type (e.g. “pod”) and then the instance you are interested in. You can also change from “builder” to “code” and enter this as the query and run the query:

{stream=”stderr”} |= `level=error`

This will show error logs from all nodes. For example (clicking on “>” symbol at left to expand entry to show the fields:

Another query that reports errors over a period of five minutes is:

count_over_time({stream="stderr"} |= `level=error` [5m])

UPDATE 1/28/2025: With a recent reinstall attempt, I was seeing an error with the loki pod in crash loop. I found out two things. First, it doesn’t look like the loki-stack install specifies an image version, and there can be a compatibility issue with Grafana. I looked tags on the loki-stack Github page, and tried the install with the latest version, by adding the argument:

--set 'image.tag=2.10.2'

This worked for me. The second is that I read that the loki-stack Helm chart is no longer supported and deprecated, and there was suggestions to use the newer Loki installer page, which has three versions – monolith for small scale testing, simple scalable for small to moderate sized clusters, and microservices for larger enterprise style clusters.

The issue is (to me) that it seems like Loki is designed to be used with cloud storage systems or, in the case of the small setups, with MinIO local filesystem S3 storage. It is also much more complex (and I guess flexible) with configurable replicas for various components.

I gave it a try and had issues with the gateway and chunks cache pods not coming up. There was some mention about Loki being designed for use with kube-dns and not the newer CoreDNS that Kubernetes uses. I gave up, and decided, for now, not to try to figure out how I could use it with Longhorn storage, like I have with loki-stack.

Customizing Logging Aggregation

You can customize the promtail pod’s configuration file so that you can do queries on custom labels, instead of searching for specific text in log messages. To do that, we first obtain the promtail configuration:

cd ~/workspace/picluster/monitoring
kubectl get secret -n monitoring loki-promtail -o jsonpath="{.data.promtail\.yaml}" | base64 --decode > promtail.yaml

Edit this file, and under the “scrape_configs” section, you will see “pipeline_stages”:

scrape_configs:
  # See also https://github.com/grafana/loki/blob/master/production/ksonnet/promtail/scrape_config.libsonnet for reference
  - job_name: kubernetes-pods
    pipeline_stages:
      - cri: {}
    kubernetes_sd_configs:

We will add a new stage called “- match”, under the “- cri:” line. To do the matching for a app called “api” we use TODO: Have a app with JSON and then describe the process. Use https://www.youtube.com/watch?v=O52dseg2bJo&list=TLPQMjkxMTIwMjOvWB8m2JEG4Q&index=7 for reference.

Uninstalling

To remove Prometheus and Grafana you must remove several CRDs,helm uninstall, and remove the secret:

kubectl delete crd alertmanagerconfigs.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com
kubectl delete crd podmonitors.monitoring.coreos.com
kubectl delete crd probes.monitoring.coreos.com
kubectl delete crd prometheusagents.monitoring.coreos.com
kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd prometheusrules.monitoring.coreos.com
kubectl delete crd scrapeconfigs.monitoring.coreos.com
kubectl delete crd servicemonitors.monitoring.coreos.com
kubectl delete crd thanosrulers.monitoring.coreos.com

helm uninstall -n monitoring prometheus-stack
kubectl delete secret -n monitoring grafana-admin-credentials

Top remove Loki, you can helm uninstall:

helm uninstall -n monitoring loki

To clean up anything else that remains, you can remove the namespace:

kubectl delete ns monitoring

Posted January 18, 2024 by pcm in category "bare-metal", "Kubernetes", "Raspberry PI