DevOps & Deployment Essentials: Your Practical CI/CD Guide

If you're still deploying code by SSHing into a server and running git pull, this guide is for you.
Modern deployment isn't magic. It's automation. Consistency. Confidence.
Let's build that.

Part 1: The Pipeline – From Code to Production
Every production application needs this flow:
Code → Build → Test → Deploy → Monitor
↓
Feedback & Rollback if needed
Each stage is automated. When something fails, you catch it before users do.
Stage 1: Code Commit (Git Workflow)
Your repository structure matters. A lot.
my-app/
├── .github/workflows/ # CI/CD pipelines
├── src/ # Source code
├── tests/ # Unit & integration tests
├── docker/ # Dockerfile & related
├── k8s/ # Kubernetes manifests
├── terraform/ # Infrastructure as code
├── README.md
└── .gitignore
Branch strategy that works:
main (production)
↑ (merge only via PR)
develop (staging)
↑ (merge feature branches)
feature/new-dashboard (your work)
feature/user-auth (teammate's work)
hotfix/critical-bug (urgent fix)
Rule: Never push to main directly. Always go through develop, create a pull request, get code review, run automated tests.
bash# You're working on a feature
git checkout -b feature/new-dashboard
git commit -m "feat(dashboard): add charts"
git push origin feature/new-dashboard

Create PR on GitHub/GitLab

Automated tests run

Code review happens

Merge to develop

Automated deploy to staging

Test on staging

When ready, merge develop → main

Automated deploy to production

Stage 2: Build (Docker)
Stop deploying Python/Node/Go directly. Use containers.
dockerfile# Dockerfile
FROM node:18-alpine

WORKDIR /app

Copy package files

COPY package*.json ./

Install dependencies

RUN npm ci

Copy source

COPY src ./src

Health check

HEALTHCHECK --interval=30s --timeout=5s \
CMD node -e "require('http').get('http://localhost:3000/health', (r) => {if (r.statusCode !== 200) throw new Error(r.statusCode)})"

Expose port

EXPOSE 3000

Start app

CMD ["npm", "start"]
Why Docker:

✅ "Works on my machine" → "Works everywhere"
✅ Version lock entire dependencies
✅ Easy to scale (run 10 copies)
✅ Security isolation
✅ Simple rollback (just switch image version)

Build optimization:
dockerfile# Bad: Bloated image
FROM node:18
COPY . .
RUN npm install
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]

Result: ~500MB image

Good: Multi-stage build

FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY src ./src
RUN npm run build

FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package.json .
EXPOSE 3000
CMD ["npm", "start"]

Result: ~120MB image

Stage 3: Test (Automated)
Your CI pipeline runs tests automatically:
yaml# .github/workflows/ci.yml
name: CI Pipeline

on:
push:
branches: [main, develop]
pull_request:
branches: [develop]

jobs:
test:
runs-on: ubuntu-latest

services:
  postgres:
    image: postgres:15
    env:
      POSTGRES_PASSWORD: testpass
    options: >-
      --health-cmd pg_isready
      --health-interval 10s
      --health-timeout 5s
      --health-retries 5

steps:
  - uses: actions/checkout@v3

  - uses: actions/setup-node@v3
    with:
      node-version: '18'

  - name: Install dependencies
    run: npm ci

  - name: Run linting
    run: npm run lint

  - name: Run unit tests
    run: npm run test:unit

  - name: Run integration tests
    run: npm run test:integration
    env:
      DATABASE_URL: postgres://user:testpass@localhost/testdb

  - name: Upload coverage
    uses: codecov/codecov-action@v3
    with:
      files: ./coverage/lcov.info

build:
needs: test # Only run if tests pass
runs-on: ubuntu-latest

steps:
  - uses: actions/checkout@v3

  - name: Build Docker image
    run: |
      docker build -t myapp:${{ github.sha }} .
      docker tag myapp:${{ github.sha }} myapp:latest

  - name: Push to registry
    run: |
      docker login -u ${{ secrets.REGISTRY_USER }} -p ${{ secrets.REGISTRY_TOKEN }}
      docker push myapp:${{ github.sha }}

What this does:

Runs linting (catch style issues)
Runs unit tests (catch logic errors)
Runs integration tests with real database (catch integration issues)
Only if ALL pass, build Docker image
Push to registry

One failing test = no deploy. That's the point.
Stage 4: Deploy (Multiple Strategies)
Blue-Green Deployment (Safest)
Blue (current): v1.2.3 (users hitting this)
Green (new): v1.3.0 (being deployed)

Steps:

Deploy v1.3.0 to green
Run smoke tests on green
If good: Switch traffic from blue → green
If bad: Switch back to blue (instant rollback)
Keep blue running for 1 hour (safety net) Canary Deployment (Progressive) Version 1.2.3: 95% of traffic Version 1.3.0: 5% of traffic

Monitor:

Error rates
Response times
Business metrics

If all good, shift traffic:

1.3.0: 25% → 50% → 100%

If problems appear, rollback immediately
Rolling Deployment (Traditional)
Deploy gradually:

Take 1 instance down, deploy new version
Bring it up
Repeat for next instance
Users never experience full downtime

Downsides: Temporarily running mixed versions (harder to debug)

Part 2: Container Orchestration (Kubernetes Essentials)
You don't need to be a Kubernetes expert. You need to know:
Basic Concepts
Pod = Smallest unit (like a container wrapper)
Service = How pods talk to each other + expose to outside
Deployment = How you define what you want running
ConfigMap = Configuration (not secrets)
Secret = Passwords, API keys, etc (encrypted)
Simple Kubernetes Deployment
yaml# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 3 # Run 3 copies
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myapp:latest
ports:
- containerPort: 3000

    # Health checks
    livenessProbe:
      httpGet:
        path: /health
        port: 3000
      initialDelaySeconds: 30
      periodSeconds: 10

    # Resource limits
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"

    # Environment variables
    env:
    - name: LOG_LEVEL
      value: "info"
    - name: DATABASE_URL
      valueFrom:
        secretKeyRef:
          name: app-secrets
          key: database-url

service.yaml

apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
type: LoadBalancer # Expose to internet
selector:
app: myapp
ports:

protocol: TCP port: 80 targetPort: 3000 What happens:

Kubernetes creates 3 pods running your app
If one crashes, it's replaced automatically
Load balancer distributes traffic
Rolling updates: New version gradually replaces old

Auto-scaling
yamlapiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 3
maxReplicas: 10
metrics:

type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70
type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 Auto-scales from 3 to 10 pods based on CPU/memory usage.

Part 3: Monitoring & Alerting
Deployed code that isn't monitored is just waiting to fail silently.
The Three Pillars of Observability

Logs (What happened) javascript// Structured logging logger.info('User login', { userId: user.id, timestamp: new Date(), ipAddress: req.ip, duration: 245 // ms });

// Output:
// {"level":"info","message":"User login","userId":"123","timestamp":"2024-05-20T...","ipAddress":"192.168.1.1","duration":245}

Metrics (What's the state) javascript// Application metrics const httpDuration = new Histogram({ name: 'http_request_duration_ms', help: 'Duration of HTTP requests in ms', labelNames: ['method', 'route', 'status_code'] });

app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
httpDuration.labels(req.method, req.route.path, res.statusCode).observe(duration);
});
next();
});

Traces (How requests flow) javascript// Distributed tracing const span = tracer.startSpan('database.query'); const result = await db.query(sql); span.setTag('query', sql); span.finish();

// Shows: Request → Service A → Service B → Database
// Plus: Time spent at each step
Setting Up Alerts That Matter
yaml# Prometheus alert rules
groups:

name: app
rules:
- alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 for: 5m annotations: summary: "High error rate detected"
- alert: DatabaseConnectionPoolExhausted expr: db_connections_active / db_connections_max > 0.9 for: 2m annotations: summary: "Database connection pool 90% full"
- alert: HighLatency expr: histogram_quantile(0.95, http_request_duration_ms) > 1000 for: 5m annotations: summary: "p95 latency > 1 second" Key principle: Alert on symptoms, not facts.

❌ Alert: "CPU > 80%"
✅ Alert: "p95 latency > 1s" (high CPU matters only if users see it)
❌ Alert: "Disk 85% full"
✅ Alert: "Disk full in 24 hours at current rate" (gives time to act)

Part 4: Rollback & Recovery
Everything fails. What matters is how fast you recover.
Automated Rollback
yaml# GitHub Actions

name: Deploy to production
run: kubectl set image deployment/myapp myapp=myapp:v1.3.0
name: Wait for rollout
run: kubectl rollout status deployment/myapp --timeout=5m
name: Run smoke tests
run: npm run test:smoke
name: If tests fail, rollback
if: failure()
run: kubectl rollout undo deployment/myapp
Database Rollback
bash# If you deployed a database migration that breaks

Option 1: Have down migration (safe)

npm run migrate:down
npm run migrate:up # New fixed version

Option 2: Point-in-time recovery

aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier prod-db-restored \
--db-snapshot-identifier prod-db-2024-05-20-03-00
The Runbook
Create this before you need it:
markdown# Incident Runbook: Database is Slow

Symptoms

p95 latency > 5s
Users complain app is slow
CPU on database high

Immediate Actions (0-5 min)

Check if we can scale database (vertical scale)
Kill long-running queries: SELECT * FROM long_queries
If query is a new deploy, rollback

Root Cause (5-30 min)

Check recent deployments: When did this start?
Check slow query log
Check if query plan changed

Resolution

Option A: Optimize query (add index)
Option B: Rollback problematic deploy
Option C: Scale database

Prevention

Add query time monitoring
Add alert for p95 latency
Load test before deploy

Part 5: Common Mistakes (And How to Avoid Them)
Mistake 1: Deploying Too Big Changes
❌ 10,000 lines of code changed in one deploy
✅ 200-500 lines per deploy
Reason: When something breaks, you know exactly what caused it.
Mistake 2: No Rollback Plan
❌ "We're committed now"
✅ Every deploy has a rollback procedure
Mistake 3: Testing Manually
❌ "We'll test in staging by hand"
✅ Automated tests run before every deploy
Mistake 4: Ignoring Logs/Metrics
❌ "The app is running, who cares about logs?"
✅ Structured logging and metrics from day 1
Mistake 5: Same Config Everywhere
❌ Production and staging use same database
✅ Separate infrastructure, separate secrets, separate configs

The DevOps Checklist
Before you call it "production-ready":
✅ CI/CD Pipeline: Every commit triggers tests & build
✅ Automated Tests: Unit + integration tests pass before deploy
✅ Containerized: Docker image with multi-stage build
✅ Orchestrated: Runs on Kubernetes or managed service
✅ Health Checks: Liveness & readiness probes configured
✅ Monitoring: Logs, metrics, traces all flowing
✅ Alerts: Meaningful alerts (not noise)
✅ Rollback Plan: Can recover in < 5 minutes
✅ Secrets Management: Passwords never in code
✅ Documentation: Runbooks for common issues

Tools You'll Use

CI/CD: GitHub Actions, GitLab CI, Jenkins
Containerization: Docker, Podman
Orchestration: Kubernetes, Docker Swarm, AWS ECS
Monitoring: Prometheus, Datadog, New Relic
Logging: ELK Stack, Splunk, CloudWatch
Tracing: Jaeger, Zipkin, Datadog

Next Steps

Set up CI/CD: Start with GitHub Actions (free)
Containerize your app: Write a Dockerfile
Deploy to staging: Use Docker Compose locally, Kubernetes in cloud
Add monitoring: Start with basic metrics
Create runbooks: Document how to handle failures
Practice rollbacks: Actually execute a rollback (in staging first)

Resources

GitHub Actions Docs
Kubernetes Documentation
Docker Best Practices
Prometheus Monitoring

Master DevOps Practices
At Vector Skill Academy, we teach DevOps the way production teams do it. Automation. Consistency. Reliability.
Explore our DevOps & Deployment program

推荐订阅源

DEV Community