What is Infrastructure as Code (IaC)?

Infrastructure as Code is the practice of managing and provisioning infrastructure through machine-readable configuration files rather than manual processes. Tools like Terraform, AWS CloudFormation, and Pulumi allow teams to version-control, review, and automate infrastructure changes with the same rigor as application code.

What is Terraform and why is it popular?

Terraform by HashiCorp is an open-source IaC tool that uses declarative HCL (HashiCorp Configuration Language) to provision and manage infrastructure across multiple cloud providers. Its popularity stems from cloud-agnostic design, state management, plan/apply workflow, and a vast provider ecosystem supporting AWS, Azure, GCP, and hundreds of other services.

April 18, 2026 26 min read Cloud & DevOps

Top 25 Cloud & DevOps Interview Questions for 2026

Q: What is a CI/CD pipeline?

A CI/CD pipeline automates the software delivery process. Continuous Integration (CI) automatically builds and tests code on every commit. Continuous Delivery (CD) automates deployment to staging/production environments. This reduces manual errors, accelerates release cycles, and enables rapid feedback.

Q: What are the key differences between AWS, Azure, and GCP?

AWS has the largest market share and broadest service catalog. Azure integrates deeply with Microsoft's enterprise ecosystem and Active Directory. GCP excels in data analytics, machine learning (BigQuery, Vertex AI), and Kubernetes (GKE originated from Google's Borg). Choose based on existing tech stack, specific service needs, and team expertise.

Cloud and DevOps roles remain among the most sought-after positions in tech. Whether you are targeting a Cloud Engineer, DevOps Engineer, SRE, or Platform Engineer role, these 25 questions cover the essential topics — from container orchestration and CI/CD pipelines to Infrastructure as Code, monitoring, security, and microservices architecture. Each answer includes practical commands and real-world context.

Vishal Thakur

Section 1: Cloud Platforms (AWS, Azure, GCP)

Q1. What are the key differences between AWS, Azure, and GCP?

The three major cloud providers each have distinct strengths and ideal use cases:

AWS (Amazon Web Services): Market leader with ~32% share. Broadest service catalog (200+ services), most mature ecosystem, largest community. Strong in: compute (EC2), storage (S3), serverless (Lambda), and overall breadth. Best for organizations wanting maximum flexibility and the deepest feature set.
Azure (Microsoft): ~23% share. Deepest integration with Microsoft enterprise products — Active Directory, Office 365, Windows Server, .NET. Strong in: hybrid cloud (Azure Arc), enterprise identity (Entra ID), and PaaS offerings (App Service, Azure Functions). Best for enterprises with existing Microsoft investments.
GCP (Google Cloud): ~11% share. Leader in data analytics and ML/AI. Strong in: BigQuery (data warehouse), Vertex AI (ML platform), GKE (Kubernetes — Google created K8s). Best for data-intensive workloads, ML/AI projects, and Kubernetes-native architectures.

In 2026, multi-cloud strategies are increasingly common — most enterprises use 2-3 providers. The key interview skill is understanding trade-offs, not just memorizing service names.

Q2. What is the Shared Responsibility Model in cloud computing?

The Shared Responsibility Model defines the security boundary between the cloud provider and the customer. Understanding this is critical for cloud security interviews.

Cloud provider is responsible for: Security OF the cloud — physical data center security, hardware, hypervisor, network infrastructure, and the managed service layer. Customer is responsible for: Security IN the cloud — data encryption, identity and access management (IAM), network configuration (security groups, NACLs), OS patching (for IaaS), application security, and compliance.

The boundary shifts based on the service model: In IaaS (EC2), you manage everything from the OS up. In PaaS (Elastic Beanstalk, App Service), the provider manages the OS and runtime. In SaaS (Office 365), the provider manages almost everything — you manage data and access. A common interview mistake is assuming the cloud provider handles all security — this misconception has caused numerous breaches from misconfigured S3 buckets and overly permissive IAM roles.

Q3. Explain the difference between IaaS, PaaS, SaaS, and FaaS.

IaaS (Infrastructure as a Service): Raw compute, storage, and networking. You manage everything from the OS up. Examples: AWS EC2, Azure VMs, GCP Compute Engine. Maximum control, maximum responsibility.
PaaS (Platform as a Service): Managed runtime and middleware. You deploy code; the provider handles OS, scaling, and patching. Examples: AWS Elastic Beanstalk, Azure App Service, Google App Engine. Faster development, less operational overhead.
SaaS (Software as a Service): Fully managed application. You configure and use it. Examples: Gmail, Salesforce, Slack. Zero infrastructure management.
FaaS (Function as a Service / Serverless): Event-driven compute — you write individual functions that execute in response to triggers. You pay only for execution time. Examples: AWS Lambda, Azure Functions, Google Cloud Functions. Ideal for event-driven architectures and variable workloads.

The trend in 2026 is toward higher abstraction levels — teams prefer PaaS and FaaS over IaaS to reduce operational burden. However, IaaS remains necessary for specialized workloads requiring fine-grained control (GPU clusters, custom networking, compliance requirements).

Q4. What is a VPC and how do you design a secure network architecture in the cloud?

A VPC (Virtual Private Cloud) is an isolated virtual network within a cloud provider where you launch resources. It gives you full control over IP addressing, subnets, route tables, and network gateways.

Secure VPC design principles:

Subnet segmentation: Use public subnets for internet-facing resources (load balancers) and private subnets for application servers and databases. Never place databases in public subnets.
Security Groups & NACLs: Security groups are stateful firewalls at the instance level (allow rules only). NACLs are stateless at the subnet level (allow and deny rules). Use both as defense in depth.
NAT Gateway: Allows private subnet resources to access the internet (for updates, API calls) without being directly reachable from the internet.
VPC Peering / Transit Gateway: Connect multiple VPCs securely. Transit Gateway provides hub-and-spoke topology for managing connections at scale.

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  tags = { Name = "production-vpc" }
}

resource "aws_subnet" "private" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "us-east-1a"
  tags = { Name = "private-subnet" }
}

Section 2: Docker & Kubernetes

Q5. What is Docker and how does it differ from virtual machines?

Docker is a containerization platform that packages applications and their dependencies into lightweight, portable containers. Containers share the host OS kernel, making them much more efficient than virtual machines.

Key differences from VMs:

Size: Docker images are typically 10-500 MB; VM images are 1-50+ GB (includes full OS).
Startup: Containers start in seconds; VMs take minutes to boot.
Isolation: VMs have stronger isolation (separate kernel); containers share the kernel but use namespaces and cgroups for isolation.
Resource efficiency: You can run 10-100 containers on a machine that might support only 5-10 VMs.

FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
USER node
CMD ["node", "server.js"]

Best practices: Use minimal base images (Alpine), multi-stage builds to reduce image size, run as non-root user, scan images for vulnerabilities with tools like Trivy or Snyk.

Q6. What is Kubernetes and what problems does it solve?

Kubernetes (K8s) is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. Originally developed by Google (based on their internal Borg system), it is now the industry standard for running containers at scale.

Problems Kubernetes solves:

Service discovery & load balancing: Automatically exposes containers using DNS or IP and distributes traffic.
Auto-scaling: Horizontal Pod Autoscaler scales pods based on CPU, memory, or custom metrics. Cluster Autoscaler adds/removes nodes.
Self-healing: Restarts failed containers, replaces unresponsive pods, and kills containers that fail health checks.
Rolling updates & rollbacks: Deploy new versions with zero downtime and roll back instantly if issues arise.
Secret & config management: Manages sensitive data and configuration separately from application code.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: myapp:v2.1
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "128Mi"
            cpu: "250m"
          limits:
            memory: "256Mi"
            cpu: "500m"

Q7. Explain the Kubernetes architecture — control plane vs. worker nodes.

Kubernetes follows a master-worker architecture with a control plane managing the cluster and worker nodes running the actual workloads.

Control Plane components:

API Server (kube-apiserver): The front door to K8s — all communication goes through the API server. It validates and processes REST requests.
etcd: Distributed key-value store that holds all cluster state and configuration. The single source of truth.
Scheduler (kube-scheduler): Assigns pods to nodes based on resource requirements, affinity rules, and constraints.
Controller Manager: Runs control loops that watch cluster state and make changes to reach desired state (ReplicaSet controller, Node controller, etc.).

Worker Node components:

kubelet: Agent on each node that ensures containers are running as specified in pod specs.
kube-proxy: Network proxy that maintains network rules for pod communication.
Container Runtime: The software that runs containers — containerd (standard in 2026), CRI-O, or similar.

Q8. What is the difference between a Pod, Deployment, Service, and Ingress in Kubernetes?

Pod: The smallest deployable unit — one or more containers that share networking and storage. Pods are ephemeral — they can be terminated and recreated at any time.
Deployment: Manages a set of identical pods (ReplicaSet). Handles rolling updates, rollbacks, and scaling. You almost never create pods directly — you create Deployments.
Service: Provides a stable network endpoint (IP/DNS) for a set of pods. Types: ClusterIP (internal only), NodePort (external via node port), LoadBalancer (cloud provider LB).
Ingress: Manages external HTTP/HTTPS access to services. Provides URL-based routing, SSL termination, and virtual hosting. Requires an Ingress Controller (NGINX, Traefik, or cloud-native options).

# Common kubectl commands
kubectl get pods -n production
kubectl describe deployment web-app
kubectl logs -f pod/web-app-abc123
kubectl scale deployment web-app --replicas=5
kubectl rollout status deployment/web-app
kubectl rollout undo deployment/web-app

Section 3: CI/CD Pipelines & Git

Q9. What is a CI/CD pipeline and why is it important?

A CI/CD pipeline automates the software delivery lifecycle from code commit to production deployment:

Continuous Integration (CI): Automatically builds code, runs unit tests, performs static analysis, and validates every commit. Catches bugs early when they are cheapest to fix. Tools: Jenkins, GitHub Actions, GitLab CI, CircleCI.

Continuous Delivery (CD): Automatically deploys validated code to staging and prepares it for production release. The final production push may require manual approval. Continuous Deployment goes further — every passing commit is automatically deployed to production with no manual intervention.

Benefits: reduced manual errors, faster release cycles (from monthly to multiple times daily), immediate feedback to developers, consistent and repeatable deployments, and improved collaboration between Dev and Ops teams. In 2026, GitOps — using Git as the single source of truth for infrastructure and application state — is the dominant CD pattern, with tools like ArgoCD and Flux automating Kubernetes deployments from Git commits.

Q10. Write a basic GitHub Actions CI/CD workflow.

GitHub Actions uses YAML workflow files in .github/workflows/ to define automated pipelines triggered by events like pushes, pull requests, or schedules.

name: CI/CD Pipeline
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
      - run: npm ci
      - run: npm test
      - run: npm run build

  deploy:
    needs: build-and-test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to production
        run: |
          echo "Deploying to production..."
          # aws s3 sync ./dist s3://my-bucket
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Key concepts: jobs run in parallel by default; needs creates dependencies. secrets store sensitive values securely. Use environments for approval gates before production deployments.

Q11. What Git branching strategy do you use and why?

The most common branching strategies in 2026:

Trunk-Based Development: All developers commit to a single main branch frequently (at least daily). Feature flags control visibility. Fastest feedback loop and simplest model. Preferred by high-performing teams (Google, Meta).
GitHub Flow: Feature branches created from main, developed, then merged via pull requests. Simple and effective for most teams.
GitFlow: Separate branches for development, staging, releases, and hotfixes. More complex but provides clear separation for teams with formal release processes.

# GitHub Flow workflow
git checkout -b feature/user-auth
git add . && git commit -m "feat: add user auth"
git push origin feature/user-auth
# Create PR → review → merge → auto-deploy

# Useful Git commands for interviews
git rebase -i HEAD~3          # Interactive rebase last 3 commits
git cherry-pick abc123        # Apply specific commit
git stash && git stash pop    # Temporarily save changes
git bisect start              # Binary search for bug-introducing commit

Q12. What is GitOps and how does it work?

GitOps is an operational framework where Git is the single source of truth for both application code and infrastructure configuration. Changes to the desired state are made via Git commits, and automated controllers continuously reconcile the actual state to match.

How it works: Define your Kubernetes manifests or Helm charts in a Git repository. A GitOps operator (ArgoCD, Flux) watches the repo and automatically applies changes to the cluster when it detects drift between the desired state (Git) and actual state (cluster). If someone manually changes a resource in the cluster, the operator reverts it to match Git.

Benefits: Complete audit trail (every change is a Git commit with author, timestamp, and review), easy rollbacks (revert the Git commit), enhanced security (no direct cluster access needed for deployments), and consistency across environments (promote by merging between branches). In 2026, GitOps with ArgoCD is the de facto standard for Kubernetes deployments in production environments.

Section 4: Infrastructure as Code & Configuration Management

Q13. What is Terraform and how does it work?

Terraform by HashiCorp is an open-source Infrastructure as Code tool that lets you define, provision, and manage cloud infrastructure using declarative configuration files written in HCL (HashiCorp Configuration Language).

Core workflow:

Write: Define resources in .tf files.
Plan: terraform plan previews changes without applying them — a safety net.
Apply: terraform apply creates/modifies/destroys resources to match the desired state.

provider "aws" {
  region = "us-east-1"
}

resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"
  tags = { Name = "web-server" }
}

resource "aws_s3_bucket" "data" {
  bucket = "my-app-data-2026"
  versioning {
    enabled = true
  }
}

State management: Terraform tracks resource state in a terraform.tfstate file. In teams, use remote backends (S3 + DynamoDB for locking, Terraform Cloud) to share state and prevent conflicts. Never commit state files to Git — they may contain secrets.

Q14. What is the difference between Terraform and Ansible?

While both are automation tools, they serve different primary purposes:

Terraform — Infrastructure provisioning. Declarative (you define desired state). Manages cloud resources (VMs, networks, databases). Uses a state file to track resources. Idempotent by design. Best for: creating and managing cloud infrastructure.
Ansible — Configuration management and application deployment. Procedural (you define steps, though it has declarative modules). Manages software on existing servers (install packages, configure services, deploy apps). Agentless — connects via SSH. Best for: configuring servers after Terraform provisions them.

In practice, many teams use both together: Terraform provisions the infrastructure (VMs, networks, load balancers), and Ansible configures the software on those machines (install NGINX, deploy application code, manage configurations). This separation follows the principle of using the right tool for each layer of the stack.

# Ansible playbook example
- name: Configure web server
  hosts: webservers
  become: yes
  tasks:
    - name: Install Nginx
      apt:
        name: nginx
        state: present
        update_cache: yes

    - name: Copy app config
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify: Restart Nginx

  handlers:
    - name: Restart Nginx
      service:
        name: nginx
        state: restarted

Q15. What are Terraform modules and why should you use them?

Terraform modules are reusable, self-contained packages of Terraform configurations. They are the primary mechanism for organizing, encapsulating, and sharing infrastructure code — analogous to functions or classes in programming.

Why use modules: Reusability — write once, use across multiple environments (dev, staging, prod). Consistency — enforce organizational standards (naming conventions, tags, security rules). Abstraction — hide complexity behind a simple interface. Maintainability — update the module once, and all consumers benefit.

# Using a module
module "vpc" {
  source  = "./modules/vpc"
  cidr    = "10.0.0.0/16"
  env     = "production"
  region  = "us-east-1"
}

module "eks_cluster" {
  source       = "./modules/eks"
  vpc_id       = module.vpc.vpc_id
  subnet_ids   = module.vpc.private_subnet_ids
  cluster_name = "prod-cluster"
  node_count   = 3
}

Best practice: Use the Terraform Registry for community modules, but maintain internal modules for organization-specific patterns. Version-pin modules to prevent breaking changes.

Section 5: Linux, Networking & Security

Q16. What Linux commands should every DevOps engineer know?

Linux proficiency is foundational for DevOps. These commands come up in interviews and daily work:

# Process management
ps aux | grep nginx          # Find running processes
top / htop                   # Real-time resource monitoring
kill -9 <PID>                # Force kill a process
systemctl status nginx       # Check service status

# File & disk operations
df -h                        # Disk usage (human-readable)
du -sh /var/log/*            # Directory sizes
find / -name "*.log" -mtime +30 -delete  # Delete old logs
tail -f /var/log/syslog      # Follow log output in real-time

# Networking
ss -tulnp                    # Show listening ports (replaces netstat)
curl -v https://api.example.com  # Debug HTTP requests
dig example.com              # DNS lookup
traceroute example.com       # Trace network path

# Text processing
grep -rn "error" /var/log/   # Search recursively with line numbers
awk '{print $1}' access.log  # Extract first column
sed -i 's/old/new/g' file    # In-place text replacement

Interview tip: Be prepared to troubleshoot a slow server on the spot. The typical workflow: check CPU/memory (top), disk (df -h), network (ss), logs (tail -f), and processes (ps aux).

Q17. Explain DNS resolution. What happens when you type a URL in a browser?

DNS (Domain Name System) translates human-readable domain names into IP addresses. The resolution process:

1. Browser cache: Check if the domain was resolved recently.
2. OS cache: Check the local DNS cache (/etc/hosts on Linux).
3. Recursive resolver: Query the ISP's or configured DNS resolver (e.g., 8.8.8.8).
4. Root name servers: The resolver asks root servers "who manages .com?"
5. TLD name servers: The .com TLD server responds "the authoritative server for example.com is ns1.example.com."
6. Authoritative name server: Returns the actual IP address (A record) for the domain.
7. Response cached: Each level caches the result based on TTL (Time To Live).

DevOps-relevant DNS concepts: A/AAAA records (IPv4/IPv6), CNAME (alias), MX (mail), TXT (verification, SPF), NS (name server delegation). In cloud environments, use managed DNS services (Route 53, Azure DNS, Cloud DNS) with health checks and weighted/geolocation routing for high availability.

Q18. What is the principle of least privilege and how do you implement it in cloud IAM?

The principle of least privilege (PoLP) states that every user, service, or application should have only the minimum permissions necessary to perform its function — nothing more. This is the foundation of cloud security.

Implementation in cloud IAM:

Use roles, not long-lived credentials: Attach IAM roles to EC2 instances and Lambda functions instead of embedding access keys. Use managed identities in Azure.
Granular policies: Instead of s3:*, specify s3:GetObject on the exact bucket and prefix needed.
Service accounts: Create dedicated service accounts for each application with scoped permissions.
Regular audits: Use AWS IAM Access Analyzer, Azure Privileged Identity Management, or GCP IAM Recommender to identify and remove unused permissions.
Temporary credentials: Use AWS STS AssumeRole for cross-account access with time-limited tokens.

Common interview scenario: "Your S3 bucket was publicly accessible and data was leaked. How would you prevent this?" Answer: Enable S3 Block Public Access at account level, use bucket policies with explicit deny for public access, enable CloudTrail logging, and set up automated alerts for policy changes.

Section 6: Monitoring, Microservices & Serverless

Q19. What is observability and how does it differ from monitoring?

Monitoring tells you when something is wrong — it tracks predefined metrics and alerts on thresholds (CPU > 90%, error rate > 1%). Observability tells you why something is wrong — it provides the ability to understand internal system state from external outputs, even for problems you did not anticipate.

The three pillars of observability:

Metrics: Numerical time-series data (request latency, CPU usage, error count). Tools: Prometheus, Datadog, CloudWatch.
Logs: Timestamped records of discrete events. Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Loki, Splunk.
Traces: End-to-end request paths across distributed services. Tools: Jaeger, Zipkin, OpenTelemetry, AWS X-Ray.

In 2026, OpenTelemetry has become the standard instrumentation framework — it provides a unified API for collecting metrics, logs, and traces, eliminating vendor lock-in. The best teams combine all three pillars into dashboards (Grafana) that allow rapid root cause analysis.

Q20. How do you set up monitoring with Prometheus and Grafana?

Prometheus + Grafana is the most popular open-source monitoring stack for cloud-native applications:

Prometheus: Pull-based metrics collection system. It scrapes metrics from instrumented applications at regular intervals, stores them in a time-series database, and supports powerful querying via PromQL. Grafana: Visualization platform that connects to Prometheus (and many other data sources) to create dashboards, alerts, and reports.

# prometheus.yml - scrape configuration
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'web-app'
    static_configs:
      - targets: ['web-app:8080']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

# Example PromQL queries for interviews:
# Request rate:    rate(http_requests_total[5m])
# Error rate:      rate(http_requests_total{status=~"5.."}[5m])
# 95th percentile: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# CPU usage:       100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Key alerting rules: set alerts for high error rates, increased latency (p99), pod restarts, disk usage > 80%, and certificate expiration. Use PagerDuty or OpsGenie for alert routing and on-call management.

Q21. What is microservices architecture and when should you use it?

Microservices architecture structures an application as a collection of small, independently deployable services, each owning a specific business domain. Each service runs in its own process, communicates via APIs (REST, gRPC, message queues), and can be developed, deployed, and scaled independently.

When to use microservices: Large teams that need independent deployment cycles, applications with components that scale differently (user service vs. analytics service), polyglot environments where different services benefit from different technologies, and organizations with mature DevOps practices.

When NOT to use: Small teams (< 10 developers), simple applications, teams without strong DevOps capabilities, and early-stage startups where rapid iteration matters more than architectural purity. As Martin Fowler notes: "Don't start with microservices — earn them." The complexity cost (distributed tracing, service mesh, data consistency) is only justified at sufficient scale.

Q22. What is serverless computing and what are its trade-offs?

Serverless computing abstracts away all infrastructure management — you write functions, and the cloud provider handles provisioning, scaling, patching, and availability. You pay only for actual compute time (per-invocation and per-millisecond).

Advantages: Zero infrastructure management, automatic scaling (from zero to thousands of concurrent executions), pay-per-use pricing (no idle costs), and faster time-to-market for event-driven features.

Trade-offs:

Cold starts: First invocation after idle period has higher latency (100ms-2s depending on runtime). Mitigate with provisioned concurrency or warm-up pings.
Execution limits: AWS Lambda has a 15-minute timeout, 10GB memory limit, and 6MB payload limit. Not suitable for long-running or compute-heavy workloads.
Vendor lock-in: Serverless architectures are tightly coupled to provider-specific services (Lambda + API Gateway + DynamoDB). Migration is complex.
Debugging complexity: Distributed, ephemeral functions are harder to debug and trace than traditional applications.

Ideal use cases: API backends, event processing, scheduled jobs, file processing triggers, and chatbot backends.

Q23. What is a service mesh and when do you need one?

A service mesh is a dedicated infrastructure layer that handles service-to-service communication in a microservices architecture. It uses sidecar proxies (one per service instance) to manage traffic routing, security, and observability transparently — without requiring application code changes.

What a service mesh provides: Mutual TLS (mTLS) encryption between services, traffic management (canary deployments, circuit breaking, retries), observability (automatic metrics, traces, and logs for inter-service communication), and policy enforcement (rate limiting, access control).

Popular service meshes: Istio (most feature-rich, complex), Linkerd (lightweight, simpler), Consul Connect (HashiCorp ecosystem). When you need one: When you have 10+ microservices and need consistent security, observability, and traffic management across all of them. When you don't: Fewer than 5-10 services — the operational overhead of a service mesh outweighs the benefits. Start with simpler solutions (application-level retries, basic load balancing) and adopt a service mesh when complexity demands it.

Q24. How do you implement zero-downtime deployments?

Zero-downtime deployments ensure users experience no interruption during releases. Common strategies:

Rolling updates: Gradually replace old pods with new ones. Kubernetes default strategy — configurable via maxSurge and maxUnavailable. Simple but all traffic sees the new version eventually.
Blue-Green deployment: Run two identical environments (blue = current, green = new). Switch traffic from blue to green after validation. Instant rollback by switching back. Higher cost (two full environments).
Canary deployment: Route a small percentage of traffic (1-10%) to the new version. Monitor error rates and latency. Gradually increase traffic if metrics are healthy. Best for risk mitigation.
Feature flags: Deploy code to all servers but control feature visibility via configuration. Decouple deployment from release. Tools: LaunchDarkly, Unleash, Flagsmith.

Critical requirement for all strategies: database backward compatibility. Schema migrations must be non-breaking — add columns before removing old ones, use expand-contract migration patterns, and never rename columns in a single deployment.

Q25. What is DevSecOps and how do you integrate security into CI/CD?

DevSecOps integrates security practices into every phase of the DevOps lifecycle — "shifting left" to catch vulnerabilities early rather than bolting on security at the end. Security becomes everyone's responsibility, not just the security team's.

Security gates in the CI/CD pipeline:

Code commit: Pre-commit hooks for secret scanning (git-secrets, detect-secrets). Prevent API keys and passwords from entering the codebase.
Build: SAST (Static Application Security Testing) — analyze source code for vulnerabilities. Tools: SonarQube, Semgrep, CodeQL.
Container build: Image scanning for known CVEs. Tools: Trivy, Snyk Container, Anchore. Enforce base image policies.
Test: DAST (Dynamic Application Security Testing) — test running applications for vulnerabilities. Tools: OWASP ZAP, Burp Suite.
Deploy: Infrastructure policy-as-code validation. Tools: OPA (Open Policy Agent), Checkov, tfsec for Terraform. Prevent misconfigured resources from reaching production.
Runtime: Runtime security monitoring (Falco), vulnerability scanning of running containers, and automated patching.

The goal is automated, continuous security — not a manual gate that slows down delivery. In 2026, supply chain security (SBOM generation, signed artifacts) is also critical following high-profile supply chain attacks.