W Code - Pattern-Based DSA Learning Platform

Bhanu Bisht

Dashboard RoadmapsSite Reliability Engineer

🔴

Site Reliability Engineer

Own production reliability — uptime, latency, incident response, and chaos engineering at scale.

Demand

High

India Salary

₹8L – ₹40L

Global Salary

$90K – $200K

Math

Medium

Coding

High

Remote

★★★★☆

What You Actually Do Daily

Own reliability of production systems: uptime, latency, error rates, saturation

Define and track SLOs (Service Level Objectives) and error budgets

On-call rotation: respond to, resolve, and write post-mortems for production incidents

Reduce toil: automate every manual operational task

Capacity planning: predict and provision infra before traffic spikes hit

Chaos engineering: intentionally break systems to find weak points (GameDay exercises)

Work embedded with development teams to enforce reliability from initial design

Skills You Need

Programming: Python or Go (SRE roles require real coding ability, not just scripting)

Deep Linux / systems internals knowledge

Observability: Prometheus, Grafana, Datadog, Jaeger, OpenTelemetry

Kubernetes: advanced (resource quotas, PodDisruptionBudgets, HPA, VPA)

Distributed systems: consensus algorithms, CAP theorem, consistency models

Incident management: PagerDuty, OpsGenie, Statuspage management

Performance profiling: flame graphs, pprof, Linux perf tool

Tools & Technologies

Prometheus + Grafana + Alertmanager

Metrics, dashboards, on-call alerts

Datadog / Dynatrace

Full-stack observability (enterprise)

Jaeger / Zipkin

Distributed tracing

PagerDuty / OpsGenie

On-call management and escalation

Chaos Monkey / Litmus Chaos

Controlled failure injection

Kubernetes (advanced)

Production container orchestration

Go / Python

Automation, tooling, internal services

Terraform

Infrastructure management

Fundamentals (Non-Negotiable)

SLI / SLO / SLA / Error Budget — define and explain each fluently (interview staple)

The Four Golden Signals: latency, traffic, errors, saturation

Distributed tracing and correlation IDs across microservices

Failure modes: cascading failures, thundering herd, retry storms, circuit breakers

The Google SRE Book — read it completely (free at sre.google)

Learning Roadmap

0–3 Months— Theory + Observability

▸

Read Google SRE Book (free at sre.google) — this is the foundational text

▸

Deep Linux: syscalls, kernel parameters, ulimits, /proc internals, performance tools

▸

Python proficiency: production-grade scripts with proper error handling, logging, typing

▸

Prometheus + Grafana in Docker: scrape metrics, build dashboards, configure alert rules

3–6 Months— Kubernetes + Tracing

▸

Kubernetes advanced: resource management, RBAC, network policies, admission controllers

▸

OpenTelemetry: instrument a sample app with distributed tracing end-to-end

▸

Implement error budgets for a dummy service; track SLOs in a Grafana dashboard

▸

Go basics: SRE roles at tier-1 companies frequently require Go proficiency

6–12 Months— Chaos + Applications

▸

Chaos engineering lab: Litmus Chaos on Kubernetes cluster — automated failure injection

▸

Simulate on-call: contribute to open-source projects with production incident tracking

▸

Apply for Junior SRE at mid-stage startups (Razorpay, Zepto, BrowserStack, Postman)

▸

Incident post-mortem practice: write 10 detailed fictional post-mortems from real outage reports

1–2 Years— Tier-1 Targeting

▸

Target FAANG SRE roles — Google, Meta, Amazon SRE pay ₹50L+ in India

▸

Specialize: Database reliability, Network reliability, or Security reliability engineering

▸

SRE knowledge is rare in India — mentoring others creates additional visibility

Salary Breakdown

Level	India	Global	Note
Junior / 1–2 yr	₹8L – ₹15L	$60K – $95K	SRE requires experience — rare fresher roles
Mid-level / 3–5 yr	₹15L – ₹28L	$95K – $150K	Certified, production incident experience
Senior / 5+ yr	₹28L – ₹40L	$150K – $200K	FAANG or global SaaS companies

Portfolio Projects

SRE Dashboard

Intermediate

All 4 golden signals monitoring

PrometheusGrafanaGo

Chaos Engineering GameDay

Advanced

Automated failure injection

Litmus ChaosKubernetesPython

Error Budget Tracker

Intermediate

SLO tracking from Prometheus data

PythonPrometheusGrafana

Incident Simulator

Advanced

Runbook auto-diagnosis

PythonPagerDuty APIBash

Certifications

CKA

CNCF · Paid (~$395)

Mandatory for any SRE role

Google Professional Cloud DevOps Engineer

Google · Paid (~$200)

SRE practices on GCP

Datadog Fundamentals

Datadog · Free

Industry-standard observability

AWS DevOps Engineer Professional

AWS · Paid (~$300)

Cloud reliability validation

Remote Work Viability

★★★★☆4/5 Remote Friendliness

High remote potential at senior levels. SRE is a rare skill globally — this creates strong negotiating power for remote work. Target: cloud-native companies, global SaaS products.

Arc.devWellfoundLinkedIn

Freelancing Potential

$150 – $300/hr (consulting)

Low direct freelancing scope. SRE is a staff function. At senior level, consulting engagements ($150–$300/hr) are possible. Better goal: full-time remote employment.

ToptalDirect consulting

Common Mistakes to Avoid

Applying without coding ability — SRE is NOT sysadmin; real programming is mandatory

Skipping distributed systems theory — appears heavily in senior interviews

Confusing SRE with DevOps: SRE = reliability engineering + coding; DevOps = pipeline + culture

5-Year Outlook

Premium role, chronically underfilled globally. As distributed systems grow more complex, SRE demand spikes. One of the highest-paid infrastructure tracks. India currently undersupplied — first-mover advantage.