Site Reliability Engineer

Own the reliability, latency, and uptime of TAL Corp's production systems — making sure every inference request succeeds.

About This Role

When millions of people depend on TAL Corp, every millisecond of downtime matters. The SRE team ensures our systems are always available, always fast, and always observable.

Responsibilities

▸Define and enforce SLOs and error budgets across all production services
▸Build and maintain alerting, on-call runbooks, and incident response playbooks
▸Lead postmortems and drive systemic reliability improvements
▸Automate toil reduction across deployment, scaling, and failure recovery
▸Collaborate with engineering teams to bake reliability into new services from day one

Requirements

▸BS in Computer Science or Systems Engineering
▸4+ years SRE, operations, or backend engineering
▸Strong understanding of distributed systems reliability patterns
▸Experience with Prometheus, Grafana, Datadog, or similar observability stacks
▸Proficiency in Python or Go for automation and tooling

Nice to Have

▸Experience with chaos engineering tools (Chaos Monkey, Gremlin)
▸Background in real-time or low-latency systems
▸Familiarity with ML model serving and inference optimisation

TAL Corp is an equal opportunity employer. We believe the best team reflects the full diversity of humanity — because we are building for all of it.

Apply Through Training

At TAL Corp you don't just send a résumé — you prove yourself. Apply by joining our training program; complete it, and top performers are hired into this role.

1Register & start your 7-day program
2Train, build real skills, earn a credential
3Top performers → straight into hiring

Already training? Log in

← View all openings