4 March 2026

We Gave Our AI the Full Picture of Our Infrastructure. Here's What Changed.

The problem isn't a lack of data. It's that no single person can hold all of it in their head at once. Here's how we connected an AI to our cluster, our monitoring, and our engineering notes — and what it changed about how we diagnose production issues.

Sam Stickland

Something is slowly eating your application's memory in production. It's not crashing — not yet. It's growing, steadily, pod by pod, until the system decides to kill it and your users start seeing errors. You've been here before. You know the answer is buried somewhere — across your monitoring dashboards, your container metrics, your application profiler, and the notes someone on the team made the last time this happened. Maybe those notes are in a wiki. Maybe they're in someone's head. Maybe they left the company six months ago.The problem isn't a lack of data. It's that no single person can hold all of it in their head at once.We run a Kubernetes-based infrastructure on Google Cloud. When we hit this exact problem recently, we didn't reach for the usual playbook of tabbing between dashboards for a few days. Instead, we opened a terminal and had a conversation.Not with a cloud console chatbot. Not with a generic AI assistant that knows about Kubernetes in theory. With an AI operating from our terminal, inside our infrastructure repository, connected to our actual monitoring tools, reading our actual engineering notes — and working through the diagnosis with us.This is what AI-assisted infrastructure management actually looks like in 2026. It's less dramatic than the hype suggests, and more useful than most people expect.What the AI Actually KnowsWhen we ask the AI to investigate an issue, it's not guessing. It's not working from general knowledge about Kubernetes and hoping for the best. It has access to the real thing.It can read our deployment manifests to see exactly how the application is configured — resource requests, health checks, replica counts, rolling update strategies. It can query the live cluster to see what's actually running right now, which pods are healthy, which are under pressure, how much memory and CPU they're consuming. It can pull incident timelines from our uptime monitoring to see when alerts fired and what the impact looked like from the outside. It can check error patterns from our error tracking to see what users were experiencing. It can review performance profiles from our APM tool to see where the application is spending its time and memory.And then there's the part that made the biggest difference. We gave it access to our engineering work logs — a running journal where we record what changed, when, why we changed it, what we tried that didn't work, and what we ruled out along the way. It can trace the git history of our configuration repository, seeing not just what the infrastructure looks like today, but how it evolved over months and what prompted each change.All of this in one conversation. No context-switching. No trying to remember which dashboard had that metric, or when exactly that config change went in, or what the reasoning was behind it.Normally this context is spread across half a dozen tools and at least one person's memory. This AI doesn't have better judgement than an experienced operator. It has better recall. And in infrastructure diagnosis, recall is usually the bottleneck.How We Built ThisThe setup has four layers. None of them are particularly complex on their own. The power is in the combination.The AI in the TerminalWe use Claude Code, Anthropic's command-line AI assistant, running directly inside our Kubernetes configuration repository. From there it can read our deployment manifests, Helm charts, and kustomize overlays. It can query live cluster state through kubectl and check GCP project configuration through gcloud. It understands the structure because it's sitting inside it.This matters more than it sounds. A generic AI assistant can answer questions about Kubernetes. An AI that's sitting in your repo can answer questions about your Kubernetes — your specific resource requests, your health check configuration, your deployment strategy, your networking setup.The Safety GuardrailsHere's the part that makes infrastructure teams nervous: an AI that can run kubectl commands against your production cluster. So we built guardrails.A pre-execution hook intercepts every command before it runs. Read-only operations — get, describe, logs, top — flow through without interruption. The AI can look at anything freely. But anything that would modify the cluster — apply, delete, scale, patch — requires explicit human approval before it executes.It's a simple bash script — pattern matching on command strings, returning a permission prompt for anything mutating. The hook runs outside the AI's control; it can't be overridden or suppressed. We've included the full script in the appendix if you want to adapt it for your own setup.The AI can diagnose. It can propose. But it cannot act without a human in the loop. This is deliberate, and not just for safety. We'll come back to why.The Observability BridgeThe missing piece in most AI-assisted workflows is context. An AI that can only see your config files is useful. An AI that can also see your monitoring data is transformative.We connected our AI environment to our key observability tools using MCP — the Model Context Protocol, an open standard that lets AI assistants communicate with external services through standardised connectors. Each connection has its own authentication, its own access controls, and its own scope.Through MCP, our AI assistant can query our uptime monitoring (Better Stack) for incident timelines, pull error patterns from our error tracking service, and review performance profiles from our APM tool. Instead of switching between four browser tabs trying to correlate timestamps, the AI queries all of them in one conversation.The Institutional MemoryThis is the layer nobody else seems to be talking about, and it might be the most important one.We maintain engineering work logs — a running markdown journal of what changed, when, why, and what was tried along the way. When we tune a resource request, the log records the reasoning. When we investigate an issue and rule something out, that's recorded too. When we make an architectural decision, the trade-offs go in the log.The AI can read these logs. It can also read the git history of our configuration repository, seeing not just the current state but how it evolved and why.This means when the AI investigates an issue, it's not starting from a blank slate. It knows that resource requests were tuned three weeks ago, and why. It knows that a previous investigation already ruled out certain causes. It knows what was tried, what worked, and what didn't. It's building on accumulated knowledge — the kind of context that usually lives in one person's head and walks out the door when they leave.What This Looks Like in PracticeDiagnosing Memory GrowthBack to our memory problem. We asked the AI to investigate why our application pods were consuming steadily more memory with no plateau.In one conversation, it pulled memory trajectories from the cluster, showing the growth pattern across pods. It cross-referenced these with allocation profiles from our APM tool, identifying which endpoints were responsible for the heaviest memory allocation. It checked the memory allocator configuration and verified it was set up correctly. Then it read back through our work logs and found the context: when resource requests had last been adjusted, what the reasoning was, and what previous investigations had already ruled out.The conclusion: this wasn't a memory leak in the traditional sense. It was fragmentation-driven growth caused by a handful of high-allocation endpoints — a search page generating dozens of redundant database queries, a billing page creating millions of short-lived objects per request. The allocator was working correctly; it was just being overwhelmed by the allocation pattern.That's a nuanced diagnosis. It requires correlating container-level metrics with application-level profiling data, understanding how memory allocators behave under pressure, and knowing the history of how the system got to its current state. Doing it manually would mean days of switching between tools, re-reading old notes, and trying to hold the full picture in your head. The AI did it in one conversation because it could see everything at once.Correlating an IncidentA different scenario: our uptime monitoring flagged an alert. Rather than opening the monitoring dashboard, then the cluster console, then the deployment history, then the work logs, we asked the AI to investigate.It pulled the incident timeline from Better Stack, cross-referenced it with pod events and recent deployments from the cluster, and checked our work logs for any related changes. In one conversation it correlated the monitoring alert with what was actually happening in the infrastructure — and surfaced the relevant context from our engineering notes that explained why the affected component was configured the way it was.The same investigation, done manually, means opening four tools, mentally correlating timestamps, and hoping you remember (or can find) the relevant context. The AI held it all in one place.The Question We're Not Asking EnoughEverything above sounds like a straightforward win. And in the short term, it is. But there's an uncomfortable question underneath it, and we think the industry needs to start taking it seriously.If the machine is doing the diagnostic reasoning — correlating the metrics, reading the logs, forming hypotheses, identifying root causes — what happens to the operator's skills over time?There's a well-documented parallel in aviation. Pilots who rely heavily on autopilot systems gradually lose their manual flying skills. The phenomenon is studied enough that the industry now mandates regular hand-flying to maintain proficiency. The automation makes flying safer on average, but it creates a dangerous gap when the automation fails and the human needs to take over.We haven't had this conversation in infrastructure yet. But we should.Our read-only guardrails aren't just a safety feature. They're a forcing function. When the AI proposes a diagnosis and a remediation, a human still has to understand the reasoning well enough to approve the action. You can't just say "fix it" — you have to evaluate whether the fix is right. That friction is deliberate. It keeps operators engaged with the diagnostic process rather than passively accepting conclusions.But even this has limits. Reviewing someone else's reasoning isn't the same as forming your own hypotheses. Over time, if you're always evaluating AI-generated diagnoses rather than building your own mental models, your diagnostic instincts atrophy.And there's a subtler risk. The institutional memory layer — our engineering work logs — is only valuable because humans recorded their reasoning. If engineers stop maintaining those logs because "the AI handles it now," the AI loses its most valuable context source. The knowledge loop breaks. The AI becomes less effective precisely because it was effective enough that people stopped feeding it.We don't have a neat answer for this. We think it's one of the most important questions in operational engineering right now, and we're not hearing enough people ask it.Why This Matters for How We HireWe wrote this article because we live in this world every day. When we're assessing engineers at Hyerhub, we understand what it means to debug memory behaviour in a containerised application, to design safety guardrails for AI-assisted operations, or to maintain the kind of engineering discipline that keeps institutional knowledge alive.We're not matching keywords on CVs. We know what good looks like because we do the work ourselves.If you're a company looking for engineers who operate at this level — or you're an engineer looking for a recruitment partner that genuinely understands what you do — that's what Hyerhub is for.Appendix: The Full Guard HookEarlier we showed a simplified version of our pre-execution hook. Here's the full version, hardened against obfuscation attempts — where the AI might construct commands using variables, eval, or subshell expansion to avoid pattern matching.#!/usr/bin/env bash # # PreToolUse hook: ensures mutating kubectl, helm, and gcloud commands # always require user permission, even if they're in the allow list. INPUT=$(cat) COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command // empty') ask_permission() { local reason="$1" echo "{\"permissionDecision\":\"ask\",\"permissionDecisionReason\":\"$reason\"}" exit 0 } # --- Obfuscation detection --- # Variable assignments storing tool names (e.g. K=kubectl; $K apply) if echo "$COMMAND" | grep -qE '\b\w+=(kubectl|helm|gcloud)\b'; then ask_permission "Variable assignment of infrastructure tool requires approval" fi # Variable or subshell expansion before mutating verbs # (e.g. $K apply, ${CMD} delete, $(echo kubectl) apply) if echo "$COMMAND" | grep -qE \ '(\$\{?\w+\}?|\$$[^)]+$)\s*(apply|delete|create|patch|edit|replace|scale|rollout|drain|cordon|uncordon|taint|annotate|label|set|expose|autoscale|run|cp|exec|debug|install|uninstall|upgrade|push|rollback|update|deploy|resize|restart|restore|start|stop)\b'; then ask_permission "Variable-based command execution requires approval" fi # eval or shell-in-shell combined with infrastructure tools or mutating verbs # (e.g. eval "kubectl apply", bash -c "helm install") if echo "$COMMAND" | grep -qE '\b(eval|bash\s+-c|sh\s+-c)\b'; then if echo "$COMMAND" | grep -qE \ '(kubectl|helm|gcloud|apply|delete|create|patch|scale|deploy|install|uninstall|upgrade|rollout|drain)\b'; then ask_permission "Obfuscated infrastructure command requires approval" fi fi # --- Direct command detection --- # Mutating kubectl commands if echo "$COMMAND" | grep -qE \ 'kubectl.*\b(apply|delete|create|patch|edit|replace|scale|rollout|drain|cordon|uncordon|taint|annotate|label|set|expose|autoscale|run|cp|exec|debug)\b'; then ask_permission "Mutating kubectl command requires approval" fi # Mutating helm commands if echo "$COMMAND" | grep -qE \ 'helm.*\b(install|uninstall|upgrade|push|rollback|repo add|repo remove)\b'; then ask_permission "Mutating helm command requires approval" fi # Mutating gcloud commands if echo "$COMMAND" | grep -qE \ 'gcloud.*\b(create|delete|update|patch|set|unset|add|remove|deploy|resize|reset|restart|restore|start|stop|suspend|resume|enable|disable|bind|import|export|move|attach|detach|rollback|clear|failover|promote|clone)\b'; then ask_permission "Mutating gcloud command requires approval" fiThe obfuscation checks run first, catching evasion attempts before the direct pattern checks. Since the consequence is a permission prompt rather than a hard block, false positives are low-cost — the operator just approves and moves on. Feel free to adapt this to your own tooling.

We Gave Our AI the Full Picture of Our Infrastructure. Here's What Changed.

Did you know 60% of organisations struggle with ineffective AI for customer service?

Future Proofing the Retail Sector with Digital Transformation

Interview with a Network Automation Expert: Lessons Learned, Common Mistakes and Advice for Businesses