惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
WordPress大学
WordPress大学
小众软件
小众软件
Cloudbric
Cloudbric
AWS News Blog
AWS News Blog
腾讯CDC
量子位
人人都是产品经理
人人都是产品经理
大猫的无限游戏
大猫的无限游戏
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
V
Vulnerabilities – Threatpost
Scott Helme
Scott Helme
Hugging Face - Blog
Hugging Face - Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
The Hacker News
The Hacker News
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
IT之家
IT之家
Jina AI
Jina AI
Attack and Defense Labs
Attack and Defense Labs
S
SegmentFault 最新的问题
Simon Willison's Weblog
Simon Willison's Weblog
The Cloudflare Blog
阮一峰的网络日志
阮一峰的网络日志
T
Tailwind CSS Blog
Last Week in AI
Last Week in AI
博客园 - 【当耐特】
Google Online Security Blog
Google Online Security Blog
美团技术团队
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
罗磊的独立博客
L
LINUX DO - 最新话题
博客园 - Franky
博客园 - 叶小钗
Apple Machine Learning Research
Apple Machine Learning Research
The Last Watchdog
The Last Watchdog
J
Java Code Geeks
AI
AI
C
Cisco Blogs
酷 壳 – CoolShell
酷 壳 – CoolShell
C
Cyber Attacks, Cyber Crime and Cyber Security
Cisco Talos Blog
Cisco Talos Blog
博客园 - 三生石上(FineUI控件)
雷峰网
雷峰网
Help Net Security
Help Net Security
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
云风的 BLOG
云风的 BLOG
I
Intezer
S
Securelist

Martin Heinz's Blog

A Guide to Python's Weak References Using weakref Module Recent Docker BuildKit Features You're Missing Out On Modern Git Commands and Features You Should Be Using Everything You Can Do with Python's textwrap Module Monitoring Indoor Air Quality with Prometheus, Grafana and a CO2 Sensor Everything You Can Do with Python's bisect Module You Don't Need a Dedicated Cache Service - PostgreSQL as a Cache A Collection of Docker Images To Solve All Your Debugging Needs Weird Python "Features" That Might Catch You By Surprise Lessons Learned From Writing 100 Articles Debugging Crashes and Deadlocks in Python using PyStack Goodbye etcd, Hello PostgreSQL: Running Kubernetes with an SQL Database Remote Interactive Debugging of Python Applications Running in Kubernetes The Right Way to Run Shell Commands From Python Real Multithreading is Coming to Python - Learn How You Can Use It Now Python's Missing Batteries: Essential Libraries You're Missing Out On Make Your CLI Demos a Breeze with Zero Stress and Zero Mistakes Reduce - The Power of a Single Python Function Why I Will Never Use Alpine Linux Ever Again Cgroups - Deep Dive into Resource Management in Kubernetes Dictionary Dispatch Pattern in Python Boost Your Python Application Performance using Continuous Profiling Lazy Evaluation Using Recursive Python Generators Python Magic Methods You Haven't Heard About Getting Started with Mastodon API in Python Backup-and-Restore of Containers with Kubernetes Checkpointing API Getting Started with Google APIs in Python Python CLI Tricks That Don't Require Any Code Whatsoever All The Ways To Introspect Python Objects at Runtime What is Python's "self" Argument, Anyway? Python List Comprehensions Are More Powerful Than You Might Think You Should Be Using Python's Walrus Operator - Here's Why Recipes and Tricks for Effective Structural Pattern Matching in Python It's Time to Say Goodbye to These Obsolete Python Libraries Advanced Features of Kubernetes' Horizontal Pod Autoscaler Data and System Visualization Tools That Will Boost Your Productivity Stop Messing with Kubernetes Finalizers Automate All the Boring Kubernetes Operations with Python End-to-End Monitoring with Grafana Cloud with Minimal Effort Bitly | bit.ly/3JLmSgA Bitly | bit.ly/3uETfbi Ultimate CI Pipeline for All of Your Python Projects Bitly | bit.ly/3M30D82 Bitly | bit.ly/3oMJ6qR Bitly | bit.ly/3IRD7IK Bitly | bit.ly/3A3B69t Profiling and Analyzing Performance of Python Programs Bitly | bit.ly/30uviIM Bitly | bit.ly/3E1X2mw Bitly | bit.ly/3Dv7JxP Bitly | bit.ly/3GG1BEz Bitly | bit.ly/3lLavs4 Bitly | bit.ly/39TqP3m Bitly | bit.ly/3A5Mpx8 Bitly | bit.ly/3kGwPl4 Bitly | bit.ly/3iHtulU Bitly | bit.ly/3xGjtKS Bitly | bit.ly/3h8DZg0 Bitly | bit.ly/2RQn1dG Bitly | bit.ly/3p2B5wW Bitly | bit.ly/3tULpb0 Bitly | bit.ly/2PHVudx Bitly | bit.ly/3uPtnb0 Bitly | bit.ly/3dg3QR9 Bitly | bit.ly/3qHtSkZ Bitly | bit.ly/3kIkTPr Bitly | bit.ly/3qlRAUN Bitly | bit.ly/3pCUJ26 Hardening Docker and Kubernetes with seccomp Bitly | bit.ly/34ZhIMt Bitly | bit.ly/3qSO7h0 Bitly | bit.ly/3muGLOk Bitly | bit.ly/35xN79v Bitly | bit.ly/3mLGshK Bitly | bit.ly/2IvkGQl Bitly | bit.ly/2Sk1KFK Bitly | bit.ly/3iCNIL6 Bitly | bit.ly/3beQPpy Saving Your Linux Machine from Certain Death New Features in Python 3.9 You Should Know About Deploy Any Python Project to Kubernetes Analyzing Docker Image Security Recursive SQL Queries with PostgreSQL Automating Every Aspect of Your Python Project Tour of Python Itertools Implementing 2D Physics in Javascript Ultimate Setup for Your Next Python Project Making Python Programs Blazingly Fast Security and Cryptography Mistakes You Are Probably Doing All The Time Going Serverless with OpenFaaS and Golang - Building Optimized Templates Going Serverless with OpenFaaS and Golang - The Ultimate Setup and Workflow Setting Up Swagger Docs for Golang API Building RESTful APIs in Golang Pytest Features, That You Need in Your (Testing) Life Setting up GitHub Package Registry with Docker and Golang Ultimate Setup for Your Next Golang Project Python Tips and Trick, You Haven't Already Seen, Part 2. Tricks for Postgres and Docker that will make your life easier Getting The Most Out of Reading Books - Reading The "Professional Way" Python Tips and Trick, You Haven't Already Seen
Kubernetes-Native Synthetic Monitoring with Kuberhealthy
Martin · 2023-04-18 · via Martin Heinz's Blog

Synthetic monitoring can be a great tool for proactively identifying performance problems, checking availability of servers, monitor DNS resolution and much more.

However, when it comes to synthetic testing, engineers often rely on 3rd party platforms such as Datadog or New Relic that provide this type of monitoring. If you're running your applications and services on Kubernetes though, you can spin up synthetic monitoring platform yourself using Kuberhealthy, and in this article we will take a look at how to deploy it, configure it, create synthetic checks and set up monitoring and alerting, all inside your own cluster.

Deploy It

Before we go ahead and deploy it, let's first answer one question, what Kuberhealthy actually is?

Kuberhealthy is a CNCF incubator project. It's a Kubernetes operator that provides KuberhealthyCheck custom resource, which lets you create builtin or custom synthetic checks that can test whether your cluster, its components or even external services are running as expected.

To deploy it, we first need to satisfy some prerequisites:

Kuberhealthy exposes synthetic check results as Prometheus metrics, so naturally we need Prometheus stack running on our cluster. If you're running applications and services on Kubernetes, chances are you're also running Prometheus stack, but in case you're not or want to follow along in playground cluster, you can use the following:


minikube delete && minikube start \
  --kubernetes-version=v1.26.1 \
  --memory=6g \
  --bootstrapper=kubeadm \
  --extra-config=kubelet.authentication-token-webhook=true \
  --extra-config=kubelet.authorization-mode=Webhook \
  --extra-config=scheduler.bind-address=0.0.0.0 \
  --extra-config=controller-manager.bind-address=0.0.0.0

minikube addons disable metrics-server

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack -f values.yaml

# https://raw.githubusercontent.com/kuberhealthy/kuberhealthy/master/deploy/grafana/dashboard.json
kubectl apply -f https://gist.githubusercontent.com/MartinHeinz/6c52f677d7a2fd3a3ff1819190ecd59d/raw/\
    f53a93c425027d9051d1bcdfc3568c6f4a7d9505/kuberhealthy-dashboard-cm.yaml

The above creates a new minikube cluster with flags necessary for running kube-prometheus-stack (Prometheus Operator), which it then also deploys using a Helm Chart. You may also notice that we also provided values.yaml during chart installation - this file includes Alertmanager configuration that we will use to send Slack notifications/alerts for failing Kuberhealthy checks later on. This values.yaml can be found in this gist.

Finally, we also deployed custom Grafana dashboard for Kuberhealthy. This dashboard is available in Kuberhealthy repository and here we deploy it as a ConfigMap which will be automatically read by kube-prometheus-stack.

With all of this deployed, we can access the Prometheus, Grafana and Alertmanager using:


kubectl port-forward -n default svc/monitoring-kube-prometheus-prometheus 9090
kubectl port-forward -n default svc/monitoring-grafana 3000:80  # user: admin, password: prom-operator
kubectl port-forward -n default svc/monitoring-kube-prometheus-alertmanager 9093

With Prometheus out of the way, let's now deploy Kuberhealthy:


helm repo add kuberhealthy https://kuberhealthy.github.io/kuberhealthy/helm-repos
helm install -n kuberhealthy kuberhealthy kuberhealthy/kuberhealthy --create-namespace --values values.yaml

Kuberhealthy is also available as a Helm Chart, which makes it easy to deploy. We again provide values.yaml to tweak the configuration:


# https://github.com/kuberhealthy/kuberhealthy/tree/master/deploy/helm/kuberhealthy
prometheus:
  enabled: true

  serviceMonitor:
    enabled: true
    release: monitoring
    namespace: default
    endpoints:
      # https://github.com/kuberhealthy/kuberhealthy/issues/726
      bearerTokenFile: ''
  prometheusRule:
    enabled: true
    release: monitoring
    namespace: default

check:
  daemonset:
    enabled: false
  deployment:
    enabled: false
  dnsInternal:
    enabled: false

We use values.yaml to enable integration with Prometheus and to specify how serviceMonitor and prometheusRule provided by Kuberhealthy should be configured so that Prometheus recognizes them. This is done by setting the release field to the name of Prometheus deployment (helm install monitoring ...).

Kuberhealthy also comes with some builtin checks enabled by default - we disable those for the time being (check stanza).

Now we can test if it's running:


kubectl port-forward -n kuberhealthy svc/kuberhealthy 8080:80

curl localhost:8080 | jq .
{
  "OK": true,
  "Errors": [],
  "CheckDetails": {},
  "JobDetails": {},
  "CurrentMaster": "kuberhealthy-6b897c89cf-2jpt7"
}

curl 'localhost:8080/metrics'

# HELP kuberhealthy_running Shows if kuberhealthy is running error free
# TYPE kuberhealthy_running gauge
kuberhealthy_running{current_master="kuberhealthy-6b897c89cf-2jpt7"} 1
# HELP kuberhealthy_cluster_state Shows the status of the cluster
# TYPE kuberhealthy_cluster_state gauge
kuberhealthy_cluster_state 1
...

Generally, you shouldn't need to query these endpoints, because Kuberhealthy metrics are automatically scraped by Prometheus, but it can be handy for doing a manual check or debugging.

Here we curl the kuberhealthy service which gives us JSON status of the Kuberhealthy cluster. This by default also returns info for all checks in the cluster. If you want to filter by namespace you can instead use e.g. curl 'localhost:8080/?namespace=default' for default namespace.

In the response from /metrics endpoint we see two metrics - kuberhealthy_cluster_state and kuberhealthy_running. Value of the former provides an "aggregated status" of all checks, meaning that if any of checks in the cluster returns 0 (fail), then kuberhealthy_cluster_state will also be 0. The latter - kuberhealthy_running - tells us whether the Kuberhealthy cluster itself runs.

Configuring Checks

With deployment and configuration done, we can start deploying our checks:


apiVersion: comcast.github.io/v1
kind: KuberhealthyCheck
metadata:
  name: ping-check
  namespace: kuberhealthy
spec:
  runInterval: 30m
  timeout: 10m
  podSpec:
    containers:
      - env:
          - name: CONNECTION_TIMEOUT
            value: "10s"
          - name: CONNECTION_TARGET
            value: "tcp://google.com:443"
        image: kuberhealthy/network-connection-check:v0.2.0
        name: main

Each check is an instance of KuberhealthyCheck custom resource and each of them specifies runInterval, timeout, and podSpec. These three fields set how often the check should run; after how long it should fail due to timeout; and YAML spec of Pod that will run the check. For the podSpec, the important parts are image and env. The image decides what check we will run, e.g. kuberhealthy/network-connection-check performs ping check, while kuberhealthy/ssl-expiry-check image will run SSL certificate expiration check. While env variables passed to the Pod (and container) configure how the check will be run, e.g. which host it should query (CONNECTION_TARGET).

After deploying this check, a check Pod will be created in kuberhealthy namespace. We can check its logs:


kubectl logs -n kuberhealthy ping-check-1679831147
time="2023-03-26T11:45:53Z" level=info msg="Found instance namespace: kuberhealthy"
time="2023-03-26T11:45:53Z" level=info msg="Kuberhealthy is located in the kuberhealthy namespace."
time="2023-03-26T11:45:53Z" level=info msg="Check time limit set to: 9m48.53977037s"
time="2023-03-26T11:45:53Z" level=info msg="CONNECTION_TARGET_UNREACHABLE could not be parsed."
time="2023-03-26T11:45:53Z" level=info msg="Running network connection checker"
time="2023-03-26T11:45:53Z" level=info msg="Successfully reported success to Kuberhealthy servers"
time="2023-03-26T11:45:53Z" level=info msg="Done running network connection check for: tcp://google.com:443"

And it was successful!

We've now tested a simple ping, what else can we deploy?


apiVersion: comcast.github.io/v1
kind: KuberhealthyCheck
metadata:
  name: duration-check
  namespace: kuberhealthy
spec:
  runInterval: 5m
  timeout: 10m
  podSpec:
    containers:
      - name: main
        image: kuberhealthy/http-check:v1.5.0
        imagePullPolicy: IfNotPresent
        env:
          - name: CHECK_URL
            value: "https://httpbin.org/delay/9"
          - name: COUNT
            value: "5"
          - name: SECONDS
            value: "1"
          - name: REQUEST_TYPE
            value: "GET"
          - name: PASSING
            value: "80"
---
apiVersion: comcast.github.io/v1
kind: KuberhealthyCheck
metadata:
  name: http-content-check
  namespace: kuberhealthy
spec:
  runInterval: 60s
  timeout: 2m
  podSpec:
    containers:
      - image: kuberhealthy/http-content-check:v1.5.0
        imagePullPolicy: IfNotPresent
        name: main
        env:
          - name: "TARGET_URL"
            value: "https://httpbin.org/anything/whatever"
          - name: "TARGET_STRING"
            value: "whatever"
          - name: "TIMEOUT_DURATION"
            value: "30s"

Here we use 2 more builtin checks - http-check and http-content-check - which serve similar use case as the ping check. First one lets you do HTTP request(s) to specified URL, with customizable number of request, expected passing percentage, timeout for individual requests, as well as request type.

While the other lets you check whether some value (TARGET_STRING) is present in response from a requested URL.

Another useful synthetic test is to check whether SSL certificate of a website is about to expire, there's builtin check for that too:


apiVersion: comcast.github.io/v1
kind: KuberhealthyCheck
metadata:
  name: website-ssl-expiry-30d
  namespace: kuberhealthy
spec:
  runInterval: 24h
  timeout: 15m
  podSpec:
    containers:
      - env:
          - name: DOMAIN_NAME
            value: "martinheinz.dev"
          - name: PORT
            value: "443"
          - name: DAYS
            value: "30"
          - name: INSECURE
            value: "false"  # Switch to 'true' if using 'unknown authority' (intranet)
        image: kuberhealthy/ssl-expiry-check:v3.2.0
        imagePullPolicy: IfNotPresent
        name: main

There are a couple more builtin checks you can use, but I don't want to go over every single one of them, instead I would recommend you check out check registry which includes list of all the available builtin checks along with examples.

Monitoring

With all the checks in place, it's time to start monitoring their results. To do so, we will write some PromQL queries. Besides the kuberhealthy_cluster_state and kuberhealthy_running mentioned earlier, Kuberhealthy provides kuberhealthy_check{check='namespace/check-name'} and kuberhealthy_check_duration_seconds{check='namespace/check-name'} for each check. We will use those to build our monitoring rules and alerts.

Prometheus Graphs

For availability reasons, Kuberhealthy runs multiple replicas of the operator, which means that we will get multiple results (series) - one for each operator Pod - for each queried check.

To avoid that, we will only query results (series) from current master Pod in the Kuberhealthy cluster which provides the authoritative data. To do so we will use following query:


label_replace(kuberhealthy_check{check="kuberhealthy/ping-check"}, "current_master", "$1", "pod", "(.+)") \
    * on (current_master) group_left() \
    topk(1, kuberhealthy_running{}) < 1

Let's explain what's happening here - kuberhealthy_running metric has current_master label that refers to the master Pod in Kuberhealthy cluster, while kuberhealthy_check metric has pod label which refers to pod from which it originates. We only want to query kuberhealthy_check metrics that have pod label value that equals to current_master label value of kuberhealthy_running metric, but to be able to match 2 metrics against each other they need to have a label with the same name. So, we use label_replace function to replace the pod label in kuberhealthy_check metric with current_master. Now both metrics have current_master label and we can query only the ones that match.

Alternatively, instead of taking results only from current master, you can simply use topk(1, kuberhealthy_check{check="kuberhealthy/ping-check"}) to grab the first series, which should work fine almost always, assuming the data is consistent across all cluster Pods.

Now, when you query any of the above kuberhealthy_* metrics, you will notice that they have a status label with value of 0/1. While a metric value obviously describes whether the check succeeded or failed, the status label describes whether there was an error (0) or not (1). This is important for monitoring, because there is a big difference between a failing check and a broken one.

Now we know what queries we can build, but we need to create PrometheusRule(s) from them to be able to set up monitoring/alerting:


apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: synthetics
  namespace: default
  labels:
    prometheus: prometheus
    release: monitoring
spec:
  groups:
    - name: synthetics
      rules:
        - alert: PingFailed
          expr: >
            label_replace(kuberhealthy_check{check="kuberhealthy/ping-check"}, "current_master", "$1", "pod", "(.+)")
            * on (current_master) group_left() topk(1, kuberhealthy_running{}) < 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: HTTP Ping failed
            description: "Kuberhealthy was not able to reach tcp://google.com:443"
        - alert: SslExpiryLessThan30d
          expr: topk(1, kuberhealthy_check{check='kuberhealthy/website-ssl-expiry-30d'}) < 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Server certificate will expire in less than 30 days
            description: "Server certificate will expire in less than 30 days"
        - alert: DurationCheckFailed
          expr: >
            avg without (endpoint, container, service, current_master, exported_namespace, job)
            (label_replace(kuberhealthy_check_duration_seconds{check='kuberhealthy/duration-check'}, "current_master", "$1", "pod", "(.+)")
            * on (current_master) group_left() topk(1, kuberhealthy_running{})) > 50
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: Check taking more than 50s to complete
            description: Check taking more than 50s to complete ().

The important parts above really are just expr stanzas, in the first one we use the query that takes data from Kuberhealthy master Pod and expects the value to be 1, otherwise the rule/alerts gets triggered (after 5 minutes). In the second one we take the simpler approach and just grab the first value using topk.

In the third check we use kuberhealthy_check_duration_seconds metric and compute the average time it takes to run the check and we have it fail if it's more than 50 seconds. Word of caution for the "duration" metrics though - they describe the length of lifetime of whole check Pod, not the individual check attempts - you should take that into consideration when deciding the threshold for rule success/fail.

Prometheus Alerts

Finally, if you used the provided Alertmanager configuration shown in the beginning, you should be able to receive Slack alerts such as:

lack Alert

In addition to these binary (true/false) rules, Kuberhealthy docs provide examples of calculating availability, utilization and latency from the available metrics.

Writing Custom Checks

So far we only used the builtin checks which to be honest will be sufficient for most of the tests. However, it's possible to built your own. Such custom check could - for example - implement checking a service that requires authentication or database-native check that validates if it's possible to connect/run query against DB.

Building custom check would warrant a separate article, so instead, to avoid making this article way too long, I will just leave you with docs link which explains how you can build a check image in language of your choice.

Also, if you want some inspiration or a starting point, I have a repository with custom check(s), such as jq-check which uses jq query to check whether a HTTP JSON response contains expected data/value.

Conclusion

If you're running applications and services on Kubernetes, chances are you're also running Prometheus stack and relying on application metrics for monitoring. While this is a good practice, it's not necessarily sufficient, as metrics aren't suitable for monitoring everything, such as SSL expiration or server/database connectivity.

Tools like Kuberhealthy are a great for filling these gaps in monitoring, while allowing you to use the same, familiar interface - that is - Kubernetes and Prometheus.

Finally, while Kuberhealthy runs on Kubernetes and aims at testing Kubernetes cluster itself, it is not confined only to the cluster. It can be also used to test external resources, such as databases or services in the Cloud, or legacy applications that don't expose metrics.