Labs
MissionCareersUpdates
Models
Terra-1Mantle-1Aether-1
SafetyResearchTeamBlogDocs
All Positions
DevOps & InfrastructurePune, IndiaFull-Time

Site Reliability Engineer

Own the reliability, latency, and uptime of TAL Corp's production systems — making sure every inference request succeeds.

About This Role

When millions of people depend on TAL Corp, every millisecond of downtime matters. The SRE team ensures our systems are always available, always fast, and always observable.

Responsibilities
  • Define and enforce SLOs and error budgets across all production services
  • Build and maintain alerting, on-call runbooks, and incident response playbooks
  • Lead postmortems and drive systemic reliability improvements
  • Automate toil reduction across deployment, scaling, and failure recovery
  • Collaborate with engineering teams to bake reliability into new services from day one
Requirements
  • BS in Computer Science or Systems Engineering
  • 4+ years SRE, operations, or backend engineering
  • Strong understanding of distributed systems reliability patterns
  • Experience with Prometheus, Grafana, Datadog, or similar observability stacks
  • Proficiency in Python or Go for automation and tooling
Nice to Have
  • Experience with chaos engineering tools (Chaos Monkey, Gremlin)
  • Background in real-time or low-latency systems
  • Familiarity with ML model serving and inference optimisation

TAL Corp is an equal opportunity employer. We believe the best team reflects the full diversity of humanity — because we are building for all of it.

Apply Through Training

At TAL Corp you don't just send a résumé — you prove yourself. Apply by joining our training program; complete it, and top performers are hired into this role.

  • 1Register & start your 7-day program
  • 2Train, build real skills, earn a credential
  • 3Top performers → straight into hiring

Already training? Log in