ML Engineer — Inference Optimisation

Make TAL models faster and cheaper to run — closing the gap between research model quality and production serving cost.

About This Role

A model that can't be served cheaply can't reach everyone. The Inference team makes TAL models blazingly fast so we can fulfill our universal access commitment.

Responsibilities

▸Profile and optimise LLM inference latency and throughput using vLLM, TensorRT-LLM, and custom CUDA kernels
▸Implement quantisation, speculative decoding, and KV cache optimisation techniques
▸Build serving infrastructure that scales from zero to millions of requests
▸Benchmark inference performance across hardware configurations
▸Collaborate with research team to ensure optimised models maintain quality

Requirements

▸BS/MS in Computer Science or Electrical Engineering
▸3+ years ML engineering with a focus on inference or model serving
▸Deep understanding of GPU architecture, CUDA programming, and memory management
▸Experience with vLLM, TensorRT, ONNX, or similar inference frameworks
▸Proficiency in Python and C++

Nice to Have

▸Experience with FlashAttention, PagedAttention, or similar attention optimisations
▸Background in compiler design or hardware architecture
▸Familiarity with quantisation-aware training (QAT) and GPTQ/AWQ

TAL Corp is an equal opportunity employer. We believe the best team reflects the full diversity of humanity — because we are building for all of it.

Apply Through Training

At TAL Corp you don't just send a résumé — you prove yourself. Apply by joining our training program; complete it, and top performers are hired into this role.

1Register & start your 7-day program
2Train, build real skills, earn a credential
3Top performers → straight into hiring

Already training? Log in

← View all openings