Every tutorial shows you how to run kubectl apply and deploy a pod. Almost none of them tell you what happens when that pod gets OOM-killed at 3 AM on a Saturday, your PersistentVolume fills up, or a node goes down during a rolling update. This is the guide we wish existed when we started running Kubernetes in production.
Resource Requests and Limits: Get Them Right
The single most impactful thing you can do for Kubernetes stability is setting accurate resource requests and limits. Requests determine scheduling — the scheduler places pods on nodes with enough available resources. Limits prevent runaway containers from starving their neighbors.
Here is what most teams get wrong: they either set no limits (dangerous — one container can consume an entire node) or set limits too low (pods get OOM-killed under normal load). The right approach is to observe actual usage over 2-4 weeks in production, then set requests at the p50 and limits at the p99 plus a 20% buffer.
Use Vertical Pod Autoscaler (VPA) in recommendation mode to get data-driven suggestions. Do not blindly apply its recommendations — review them and understand why your application uses the resources it does.
Health Checks Are Not Optional
Every production pod needs three probes configured:
- Startup probe: Prevents Kubernetes from killing slow-starting containers. Essential for Java applications with long JVM warmup times.
- Liveness probe: Detects containers that are running but broken (deadlocked, stuck in an infinite loop). Kubernetes restarts them automatically.
- Readiness probe: Determines when a pod is ready to receive traffic. During rolling updates, this prevents traffic from hitting pods that have not finished initializing.
The most common mistake we see is using the same endpoint for all three probes. Your liveness probe should check if the process is alive. Your readiness probe should check if the application can serve requests — including database connectivity, cache availability, and downstream service health.
Observability: The Three Pillars
You cannot operate what you cannot observe. In a microservices architecture on Kubernetes, observability is not a nice-to-have — it is a survival requirement.
Metrics (Prometheus + Grafana)
At minimum, you need the RED metrics for every service: Rate (requests per second), Errors (error rate), and Duration (latency distribution). Use Prometheus with the kube-state-metrics and node-exporter for cluster-level metrics. Build Grafana dashboards that show the health of your system at a glance.
Logging (Loki or ELK)
Structured JSON logging with correlation IDs that flow across service boundaries. Every log line should include the request ID, service name, and trace ID. Use Loki for cost-effective log aggregation or ELK if you need full-text search across terabytes of logs.
Tracing (Jaeger or Tempo)
Distributed tracing shows you exactly how a request flows through your system. When a customer reports that checkout is slow, tracing tells you whether the bottleneck is in the cart service, the payment gateway, or the inventory check. OpenTelemetry is the standard — instrument once, export to any backend.
Network Policies: Default Deny
By default, every pod in a Kubernetes cluster can talk to every other pod. This is terrifying from a security perspective. If an attacker compromises one service, they can reach every other service in the cluster.
Start with a default-deny network policy in every namespace, then explicitly allow only the traffic that should flow. The order service can talk to the payment service. The payment service cannot talk to the user profile service. Document your network policies as part of your architecture diagrams.
Secrets Management
Kubernetes Secrets are base64-encoded, not encrypted. Anyone with RBAC access to read secrets in a namespace can decode them trivially. For production workloads, use an external secrets manager:
- HashiCorp Vault with the Vault Agent Injector for dynamic secret generation
- AWS Secrets Manager with the External Secrets Operator
- Sealed Secrets for GitOps workflows where secrets need to live in version control
Rotate secrets automatically. Audit secret access. Never hardcode credentials in container images or environment variables in your deployment manifests.
Rolling Updates and Rollbacks
Kubernetes rolling updates are powerful but have sharp edges. Key settings to configure:
- maxUnavailable: How many pods can be down during an update. Set to 0 for zero-downtime deployments.
- maxSurge: How many extra pods can be created during the update. Set to 25% for a balance of speed and resource usage.
- minReadySeconds: How long a new pod must be ready before the old one is terminated. Prevents premature rollover.
Always test your rollback procedure. Run kubectl rollout undo in staging before you need it in production. Ensure your database migrations are backward-compatible so rollbacks do not break data integrity.
Cost Optimization
Kubernetes makes it easy to over-provision. Without active cost management, your cloud bill will grow 30-50% faster than your traffic. Key strategies:
- Right-size your nodes: Use the cluster autoscaler with appropriate instance types. Spot instances for stateless workloads can save 60-70%.
- Set resource requests accurately: Over-requesting wastes money. Under-requesting causes instability. Measure and adjust quarterly.
- Use namespace resource quotas: Prevent any single team from consuming more than their fair share of cluster resources.
- Implement pod disruption budgets: Ensure critical services maintain minimum availability during node scaling events.
The Reality Check
Kubernetes is powerful, but it is not magic. It does not eliminate operational complexity — it shifts it. Instead of managing servers, you manage clusters. Instead of debugging process crashes, you debug pod scheduling, networking, and storage.
The teams that succeed with Kubernetes in production are the ones that invest in training, tooling, and operational discipline. They treat their cluster like production infrastructure — with monitoring, alerting, runbooks, and on-call rotations.
If you are running Kubernetes in production and want a second opinion on your setup, reach out. We do cluster health assessments and can identify quick wins for reliability and cost optimization.