Labs
MissionCareersUpdates
Models
Terra-1Mantle-1Aether-1
SafetyResearchTeamBlogDocs
All Positions
AGI ResearchChennai, IndiaFull-Time

Data Scientist — Pretraining

Own the data pipelines and curation strategies that determine what TAL models learn — the quality of training data is the quality of the model.

About This Role

The best models are built on the best data. You will own everything from web-scale data collection to quality filtering, deduplication, and domain mixing.

Responsibilities
  • Design and maintain large-scale data ingestion and preprocessing pipelines
  • Develop quality scoring, deduplication, and toxicity filtering systems
  • Run ablation studies on data mixture and curriculum design
  • Build evaluation frameworks to measure data quality impact on model capabilities
  • Collaborate with research scientists on dataset design for specific capabilities
Requirements
  • MS or PhD in Computer Science, Statistics, or related field
  • 3+ years experience with large-scale data engineering
  • Expertise in Python, Spark, or similar distributed data processing tools
  • Familiarity with Common Crawl, C4, or similar web-scale datasets
  • Strong understanding of tokenisation and language data characteristics
Nice to Have
  • Experience with multilingual or code data pipelines
  • Background in information retrieval or corpus linguistics
  • Familiarity with data provenance and copyright compliance at scale

TAL Corp is an equal opportunity employer. We believe the best team reflects the full diversity of humanity — because we are building for all of it.

Apply Through Training

At TAL Corp you don't just send a résumé — you prove yourself. Apply by joining our training program; complete it, and top performers are hired into this role.

  • 1Register & start your 7-day program
  • 2Train, build real skills, earn a credential
  • 3Top performers → straight into hiring

Already training? Log in