惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
J
Java Code Geeks
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
H
Hackread – Cybersecurity News, Data Breaches, AI and More
V
Visual Studio Blog
G
Google Developers Blog
V
V2EX
The Register - Security
The Register - Security
博客园 - 三生石上(FineUI控件)
云风的 BLOG
云风的 BLOG
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
博客园_首页
S
SegmentFault 最新的问题
博客园 - Franky
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog
A
About on SuperTechFans
人人都是产品经理
人人都是产品经理
aimingoo的专栏
aimingoo的专栏
罗磊的独立博客
C
Check Point Blog
MyScale Blog
MyScale Blog
T
The Blog of Author Tim Ferriss
MongoDB | Blog
MongoDB | Blog
The GitHub Blog
The GitHub Blog
Last Week in AI
Last Week in AI
Microsoft Azure Blog
Microsoft Azure Blog
IT之家
IT之家
F
Fortinet All Blogs
Jina AI
Jina AI
P
Proofpoint News Feed
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
阮一峰的网络日志
阮一峰的网络日志
B
Blog
L
LangChain Blog
月光博客
月光博客
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
宝玉的分享
宝玉的分享
博客园 - 【当耐特】
T
Tailwind CSS Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
Microsoft Security Blog
Microsoft Security Blog
WordPress大学
WordPress大学
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
B
Blog RSS Feed
博客园 - 聂微东
Hugging Face - Blog
Hugging Face - Blog
M
MIT News - Artificial intelligence
GbyAI
GbyAI

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Building an LLM-Powered Log Triage Pipeline with Python and DeepSeek-R1
Prajwol Adhi · 2026-05-10 · via DEV Community

Introduction

I have Prometheus and Grafana monitoring my homelab. I have Alertmanager sending Discord notifications when a node goes down. But there was a gap in the middle that kept bugging me.

Prometheus tells me that something is wrong. CPU is high. A container restarted. A scrape target is unreachable. What it does not tell me is why. For that, you need to read the logs. And reading Docker logs across multiple containers, multiple times a day, is the kind of task that feels productive for about ten minutes before you start skimming and missing things.

So I built something to read them for me. A Python script that runs every 15 minutes, pulls Docker container logs, checks for anything that looks critical, and sends the critical stuff to a small language model running on my Oracle Cloud instance. The model reads the raw log entry and writes a plain-English summary. That summary gets posted to a Discord channel.

Instead of me reading through hundreds of log lines and hoping I notice the important one, an LLM reads them and only bothers me when something actually matters.

This is not a fancy AI agent with tool use and multi-step reasoning. It is a straightforward automation — rules-based triage plus an LLM for summarization. But it solves a real problem I was actually having, and it taught me a lot about how to practically integrate an LLM into an infrastructure workflow.


Why not just use Alertmanager for everything?

Fair question. Alertmanager handles the metrics side well — if CPU spikes above 90% for five minutes, or if a node goes unreachable, it fires an alert. But metrics and logs are different things.

A container can be running fine from a metrics perspective — CPU normal, memory stable, responding to health checks — but still be logging errors internally. Maybe it is failing to connect to an upstream API. Maybe it is retrying a database connection every 30 seconds. Maybe there is a deprecation warning that will become a breaking change next release. None of that shows up in Prometheus metrics. All of it shows up in logs.

The log triage pipeline covers the gap between "the container is running" and "the container is healthy."


Chapter 1: The Architecture

The pipeline has four components spread across two machines:

On my local server (Waco, Texas):

  • The Python script that reads Docker logs and classifies severity
  • A cron job that runs the script every 15 minutes
  • Docker, whose containers produce the logs On the Oracle Cloud instance (Phoenix, Arizona):
  • Ollama, serving the DeepSeek-R1 1.5B model as a REST API In between:
  • Tailscale, connecting both machines over an encrypted mesh VPN
  • Discord webhooks, receiving the final alert messages The separation is intentional. The LLM runs on the Oracle instance because it has 24GB of RAM — enough to load a small model comfortably. My local server has less headroom, and I did not want model inference competing with the Docker services it is supposed to be monitoring.

The Python script calls the Ollama API over Tailscale, so the traffic never touches the public internet. The model endpoint is not exposed to anyone outside my Tailscale network.


Chapter 2: Setting Up Ollama and DeepSeek-R1

Ollama makes self-hosting a language model surprisingly painless. On the Oracle instance, the setup was:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull deepseek-r1:1.5b

Enter fullscreen mode Exit fullscreen mode

That is it. Ollama downloads the model and serves it as a REST API on port 11434. You can test it immediately:

curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:1.5b",
  "prompt": "Summarize this log entry: ERROR: database connection refused at 10.0.0.5:5432, retrying in 30s",
  "stream": false
}'

Enter fullscreen mode Exit fullscreen mode

And it responds with a natural-language summary of what the log entry means.

I chose the 1.5B parameter model for a reason. It is small enough to run on the Oracle ARM instance without maxing out memory, and fast enough that inference takes a few seconds per log entry rather than minutes. For summarizing log lines, you do not need GPT-4 level intelligence. You need something that can read a stack trace and say "the database connection is failing" in plain English. The 1.5B model does that reliably.

A larger model would produce slightly more polished summaries, but the latency and memory tradeoff is not worth it for an automation that runs every 15 minutes. I would rather have fast and good enough than slow and perfect.


Chapter 3: The Python Script — Rules First, LLM Second

This is where the design decision that matters most lives. The script does not send every log line to the LLM. That would be slow, expensive on compute, and pointless — most log lines are routine. Instead, it uses a two-stage approach:

Stage 1: Rules-based severity classification. The script reads the last 15 minutes of logs from each Docker container using docker logs --since 15m. It then checks each line against a set of keyword patterns:

  • Lines containing error, fatal, critical, OOM, killed, panic, exception → classified as critical
  • Lines containing warn, timeout, retry, refused → classified as warning
  • Everything else → ignored This is intentionally simple. I am not trying to build a perfect classifier. I am trying to filter out the 95% of log lines that say things like "request completed in 12ms" so the LLM only has to deal with the 5% that might actually matter.

Stage 2: LLM summarization. Only the lines classified as critical get sent to DeepSeek. The prompt is straightforward:

prompt = f"""You are a DevOps engineer reviewing system logs.
Summarize the following log entry in one or two sentences.
Explain what happened and whether immediate action is needed.

Log entry:
{log_line}"""

Enter fullscreen mode Exit fullscreen mode

The model returns a summary like: "The Grafana container failed to authenticate with its PostgreSQL backend. The connection was refused, suggesting the database container may be down or the credentials have changed. Immediate investigation recommended."

That summary is what gets posted to Discord — not the raw log line, but the plain-English interpretation of it.


Chapter 4: The Discord Integration

Discord webhooks are probably the simplest notification integration you can set up. You create a webhook URL in your Discord server settings, and then posting to it is one HTTP request:

import requests

def send_discord_alert(summary, container_name, severity):
    webhook_url = "your-discord-webhook-url"

    payload = {
        "embeds": [{
            "title": f"🔴 {severity.upper()}{container_name}",
            "description": summary,
            "color": 15158332  # red
        }]
    }

    requests.post(webhook_url, json=payload)

Enter fullscreen mode Exit fullscreen mode

The embed format gives you a clean, colored card in Discord rather than a wall of text. Critical alerts show up in red. Warnings could show up in yellow if I ever decide to surface those too — for now I only send critical ones to keep the noise low.

The webhook URL is stored as an environment variable, not hardcoded. I learned this the hard way earlier in the project when I accidentally shared webhook URLs in a chat and had to regenerate them. Treat webhook URLs like API keys — anyone with the URL can post to your channel.


Chapter 5: The Cron Job

The script runs every 15 minutes via cron on my local server:

*/15 * * * * /usr/bin/python3 /home/user/scripts/log-triage.py >> /var/log/log-triage.log 2>&1

Enter fullscreen mode Exit fullscreen mode

Fifteen minutes is a balance between responsiveness and noise. Every 5 minutes would catch things faster but generate more Discord traffic during noisy periods (like when I am actively deploying something and containers are restarting). Every hour would miss things for too long. Fifteen minutes means I find out about a critical issue within fifteen minutes — which for a homelab is perfectly fine.

The output gets appended to its own log file, which is a bit meta — the log triage tool has its own logs. But it is useful for debugging when the script itself fails, which happened more than once during development.


Chapter 6: What I Learned Building This

The rules-based first stage is doing most of the work. I originally planned to send all logs to the LLM and let it figure out what was important. That was a mistake. The model was slow, the responses were inconsistent for routine log lines, and the Discord channel was flooded with summaries of perfectly normal events. Adding the keyword filter in front cut the LLM calls by about 95% and made the whole pipeline actually useful.

This is a pattern I have seen in every discussion about production LLM systems: you almost always want a cheap, fast filter in front of the expensive, slow model. Let the simple rules handle the simple cases. Only escalate to the LLM when something actually needs interpretation.

Small models are fine for specific tasks. There is a temptation to reach for the biggest model you can run. But for log summarization, the 1.5B parameter model produces perfectly adequate output. It occasionally misses nuance that a larger model would catch, but the summaries are accurate enough to tell me whether I need to investigate further. For an alerting pipeline, "accurate enough to trigger investigation" is the right bar — not "perfect analysis."

Self-hosting has real advantages for this use case. I could have called an external API like OpenAI or Anthropic instead of running my own model. But there are three reasons I did not:

  1. Cost — at 96 runs per day, even cheap API calls add up over months. The Oracle instance is free tier.
  2. Privacy — I am sending my infrastructure logs to the model. Even in a homelab, I would rather not send container logs to a third-party API.
  3. Latency — the Ollama instance responds in 2-3 seconds over Tailscale. An API call over the internet would be similar, but with more variable latency and the possibility of rate limiting. This is not an AI agent. I want to be clear about what this is and what it is not. An agent makes decisions and takes actions — it might read a log, decide the database needs restarting, and execute the restart. This pipeline does not do that. It reads logs, summarizes them, and tells me about them. I am still the one who decides what to do. That is a deliberate choice — I am not comfortable with automated remediation on infrastructure I actually depend on. Maybe in a future iteration.

What Could Be Better

There are obvious improvements I have not made yet:

Smarter classification. The keyword matching is crude. "Error" in a log line is not always an error — sometimes it is a log line about error handling working correctly, like "recovered from error successfully." A more sophisticated approach would use regex patterns tuned per container, or even a small classifier model. For now, the false positive rate is low enough that I live with it.

Log aggregation with Loki. Right now, the script runs docker logs on each container individually. If I set up Grafana Loki, all container logs would flow into a central store, and the script could query Loki instead of Docker directly. That is a cleaner architecture and it is on my roadmap for a future phase.

Alert deduplication. If a container logs the same error repeatedly (like a connection retry every 30 seconds), the script will send the same alert multiple times. I should add a simple cache that tracks recently seen errors and suppresses duplicates within a time window.


The Monitoring Stack So Far

This pipeline sits alongside the rest of the observability stack I have been building across the hybrid cloud project:

  • Prometheus scrapes system metrics (CPU, memory, disk, network) from three geographically distributed nodes — my local server in Texas, Oracle Cloud in Arizona, and a shell server in the Netherlands.
  • Grafana visualizes those metrics on dashboards.
  • Alertmanager fires alerts to Discord when metric-based rules trigger (like a node going unreachable).
  • This Python pipeline covers the log side — reading container logs, summarizing critical entries with DeepSeek, and posting summaries to Discord. Together, they give me visibility into both the system-level health (metrics) and the application-level behavior (logs) of the homelab. Not bad for infrastructure running on a laptop and a free-tier cloud instance.

Appendix: The Complete Script

Here is a cleaned-up version of the script. Replace the placeholder values with your own container names, Ollama endpoint, and Discord webhook URL.

#!/usr/bin/env python3
"""
LLM-Augmented Log Triage Pipeline
Rules-based severity classification + DeepSeek-R1 summarization.
Runs via cron every 15 minutes.
"""
import subprocess
import requests
import json
import os
from datetime import datetime

# ── Configuration ──────────────────────────────────────────────
DISCORD_WEBHOOK_URL = os.environ.get("DISCORD_WEBHOOK_URL")
if not DISCORD_WEBHOOK_URL:
    raise ValueError("DISCORD_WEBHOOK_URL not set")

OLLAMA_URL   = "http://<your-ollama-host>:11434/api/generate"
OLLAMA_MODEL = "deepseek-r1:1.5b"

# Containers to monitor — adjust to match your Docker stack
CONTAINERS = [
    "prometheus",
    "grafana",
    "alertmanager",
    "nginx-proxy",
    "adguard",
]

# ── Stage 1: Rules-based triage ────────────────────────────────

# Keywords that trigger LLM analysis
ESCALATE_KEYWORDS = [
    "fatal", "panic", "oom", "killed", "out of memory",
    "disk full", "no space left", "corruption", "segfault",
    "exception", "unauthorized", "authentication failed",
    "permission denied", "container exited",
    "exit code 1", "exit code 2",
]

# Known-harmless patterns to ignore before keyword matching
IGNORE_PATTERNS = [
    "filter update",          # adguard routine
    "nginx reloaded",         # proxy routine
    "certificate renewed",    # TLS renewal noise
    "checkpoint",             # prometheus WAL compaction
    "compacted",              # prometheus normal
    "watching for new ooms",  # cadvisor startup
]


def get_container_logs(container, lines=30):
    """Pull the last N lines of logs from a Docker container."""
    try:
        result = subprocess.run(
            ["docker", "logs", "--tail", str(lines), container],
            capture_output=True, text=True, timeout=10
        )
        output = (result.stdout + result.stderr).strip()
        return output[:1500] if output else "No output."
    except Exception as e:
        return "Error: " + str(e)


def should_analyze(logs):
    """
    Rules-based filter. Strips known-harmless patterns first,
    then checks for escalation keywords.
    Returns (needs_analysis: bool, matched_keyword: str or None).
    """
    logs_lower = logs.lower()

    for pattern in IGNORE_PATTERNS:
        if pattern in logs_lower:
            logs_lower = logs_lower.replace(pattern, "")

    for keyword in ESCALATE_KEYWORDS:
        if keyword in logs_lower:
            return True, keyword

    return False, None


# ── Stage 2: LLM summarization ────────────────────────────────

def analyze_with_ai(container, logs, trigger_keyword):
    """Send critical logs to DeepSeek for plain-English summarization."""
    prompt = (
        "You are an SRE. A Docker container triggered an alert.\n\n"
        f"Container: {container}\n"
        f"Trigger keyword found: {trigger_keyword}\n\n"
        f"Logs:\n{logs}\n\n"
        "Explain in 2-3 sentences:\n"
        "1. What is the actual problem?\n"
        "2. How severe is it: critical or warning?\n"
        "3. What should the engineer do?\n"
    )

    try:
        resp = requests.post(
            OLLAMA_URL,
            json={
                "model": OLLAMA_MODEL,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": 0.1,
                    "num_predict": 1000,
                    "num_ctx": 1024,
                }
            },
            timeout=300
        )
        resp.raise_for_status()
        raw = resp.json().get("response", "").strip()

        # DeepSeek-R1 wraps reasoning in <think> tags — strip them
        if "<think>" in raw:
            raw = raw.split("</think>")[-1].strip()

        # Determine severity from the model's response
        raw_lower = raw.lower()
        severity = "warning"
        if "critical" in raw_lower and "not critical" not in raw_lower:
            severity = "critical"

        return {"analysis": raw, "severity": severity}

    except Exception as e:
        return {"analysis": "AI analysis failed: " + str(e), "severity": "warning"}


# ── Discord alerting ───────────────────────────────────────────

def send_discord_alert(container, trigger_keyword, analysis_result):
    """Post a formatted embed to Discord with the LLM summary."""
    severity = analysis_result.get("severity", "warning")
    colors = {"critical": 0xF85149, "warning": 0xE3B341}

    payload = {
        "embeds": [{
            "title": f"Alert — {container}",
            "color": colors.get(severity, 0xE3B341),
            "fields": [
                {"name": "Container",       "value": f"`{container}`",       "inline": True},
                {"name": "Severity",        "value": severity.upper(),       "inline": True},
                {"name": "Trigger keyword", "value": f"`{trigger_keyword}`", "inline": False},
                {"name": "AI Analysis",     "value": analysis_result.get("analysis", ""), "inline": False},
                {"name": "Time",            "value": datetime.now().strftime("%Y-%m-%d %H:%M:%S"), "inline": False},
            ],
            "footer": {"text": "Rules triage + DeepSeek-R1 1.5B"}
        }]
    }

    try:
        requests.post(DISCORD_WEBHOOK_URL, json=payload, timeout=5)
    except Exception as e:
        print(f"Discord failed: {e}")


# ── Main loop ──────────────────────────────────────────────────

def main():
    print(f"\n[{datetime.now().strftime('%H:%M:%S')}] Log triage starting...")
    escalated = 0

    for container in CONTAINERS:
        logs = get_container_logs(container)
        needs_analysis, keyword = should_analyze(logs)

        if not needs_analysis:
            continue

        result = analyze_with_ai(container, logs, keyword)
        send_discord_alert(container, keyword, result)
        escalated += 1

    print(f"Done. {escalated}/{len(CONTAINERS)} containers escalated.")


if __name__ == "__main__":
    main()

Enter fullscreen mode Exit fullscreen mode

To run it on a 15-minute schedule, add a cron job:

crontab -e

Enter fullscreen mode Exit fullscreen mode

*/15 * * * * DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/your-webhook-here" /usr/bin/python3 /path/to/log-triage.py >> /var/log/log-triage.log 2>&1

Enter fullscreen mode Exit fullscreen mode


What is Next

The hybrid cloud series continues with Part 3: K3s Kubernetes Cluster — setting up a K3s cluster with my local server as the control plane and the Oracle Cloud instance as a worker node, connected over Tailscale. Once that is running, I plan to containerize this log triage pipeline itself and deploy it as a Kubernetes workload, shipped through the CI/CD pipeline I built in Part 1. That would close the loop — the monitoring tool running inside the system it monitors, delivered through the same pipeline as everything else.

Stay tuned, and happy building.