Loading W Code...
Own reliability of production systems: uptime, latency, error rates, saturation
Define and track SLOs (Service Level Objectives) and error budgets
On-call rotation: respond to, resolve, and write post-mortems for production incidents
Reduce toil: automate every manual operational task
Capacity planning: predict and provision infra before traffic spikes hit
Chaos engineering: intentionally break systems to find weak points (GameDay exercises)
Work embedded with development teams to enforce reliability from initial design
Programming: Python or Go (SRE roles require real coding ability, not just scripting)
Deep Linux / systems internals knowledge
Observability: Prometheus, Grafana, Datadog, Jaeger, OpenTelemetry
Kubernetes: advanced (resource quotas, PodDisruptionBudgets, HPA, VPA)
Distributed systems: consensus algorithms, CAP theorem, consistency models
Incident management: PagerDuty, OpsGenie, Statuspage management
Performance profiling: flame graphs, pprof, Linux perf tool
Metrics, dashboards, on-call alerts
Full-stack observability (enterprise)
Distributed tracing
On-call management and escalation
Controlled failure injection
Production container orchestration
Automation, tooling, internal services
Infrastructure management
SLI / SLO / SLA / Error Budget โ define and explain each fluently (interview staple)
The Four Golden Signals: latency, traffic, errors, saturation
Distributed tracing and correlation IDs across microservices
Failure modes: cascading failures, thundering herd, retry storms, circuit breakers
The Google SRE Book โ read it completely (free at sre.google)
Read Google SRE Book (free at sre.google) โ this is the foundational text
Deep Linux: syscalls, kernel parameters, ulimits, /proc internals, performance tools
Python proficiency: production-grade scripts with proper error handling, logging, typing
Prometheus + Grafana in Docker: scrape metrics, build dashboards, configure alert rules
Kubernetes advanced: resource management, RBAC, network policies, admission controllers
OpenTelemetry: instrument a sample app with distributed tracing end-to-end
Implement error budgets for a dummy service; track SLOs in a Grafana dashboard
Go basics: SRE roles at tier-1 companies frequently require Go proficiency
Chaos engineering lab: Litmus Chaos on Kubernetes cluster โ automated failure injection
Simulate on-call: contribute to open-source projects with production incident tracking
Apply for Junior SRE at mid-stage startups (Razorpay, Zepto, BrowserStack, Postman)
Incident post-mortem practice: write 10 detailed fictional post-mortems from real outage reports
Target FAANG SRE roles โ Google, Meta, Amazon SRE pay โน50L+ in India
Specialize: Database reliability, Network reliability, or Security reliability engineering
SRE knowledge is rare in India โ mentoring others creates additional visibility
| Level | India | Global | Note |
|---|---|---|---|
| Junior / 1โ2 yr | โน8L โ โน15L | $60K โ $95K | SRE requires experience โ rare fresher roles |
| Mid-level / 3โ5 yr | โน15L โ โน28L | $95K โ $150K | Certified, production incident experience |
| Senior / 5+ yr | โน28L โ โน40L | $150K โ $200K | FAANG or global SaaS companies |
All 4 golden signals monitoring
Automated failure injection
SLO tracking from Prometheus data
Runbook auto-diagnosis
CNCF ยท Paid (~$395)
Mandatory for any SRE role
Google ยท Paid (~$200)
SRE practices on GCP
Datadog ยท Free
Industry-standard observability
AWS ยท Paid (~$300)
Cloud reliability validation
High remote potential at senior levels. SRE is a rare skill globally โ this creates strong negotiating power for remote work. Target: cloud-native companies, global SaaS products.
Low direct freelancing scope. SRE is a staff function. At senior level, consulting engagements ($150โ$300/hr) are possible. Better goal: full-time remote employment.
Applying without coding ability โ SRE is NOT sysadmin; real programming is mandatory
Skipping distributed systems theory โ appears heavily in senior interviews
Confusing SRE with DevOps: SRE = reliability engineering + coding; DevOps = pipeline + culture
Premium role, chronically underfilled globally. As distributed systems grow more complex, SRE demand spikes. One of the highest-paid infrastructure tracks. India currently undersupplied โ first-mover advantage.
Build and maintain CI/CD pipelines, containerize applications, and drive infrastructure automation.
View RoadmapDesign, deploy, and manage scalable cloud infrastructure on AWS, GCP, or Azure.
View RoadmapBuild infrastructure for training, deploying, and monitoring ML models in production at scale.
View Roadmap