Data Scientist — Pretraining

Own the data pipelines and curation strategies that determine what TAL models learn — the quality of training data is the quality of the model.

About This Role

The best models are built on the best data. You will own everything from web-scale data collection to quality filtering, deduplication, and domain mixing.

Responsibilities

▸Design and maintain large-scale data ingestion and preprocessing pipelines
▸Develop quality scoring, deduplication, and toxicity filtering systems
▸Run ablation studies on data mixture and curriculum design
▸Build evaluation frameworks to measure data quality impact on model capabilities
▸Collaborate with research scientists on dataset design for specific capabilities

Requirements

▸MS or PhD in Computer Science, Statistics, or related field
▸3+ years experience with large-scale data engineering
▸Expertise in Python, Spark, or similar distributed data processing tools
▸Familiarity with Common Crawl, C4, or similar web-scale datasets
▸Strong understanding of tokenisation and language data characteristics

Nice to Have

▸Experience with multilingual or code data pipelines
▸Background in information retrieval or corpus linguistics
▸Familiarity with data provenance and copyright compliance at scale

TAL Corp is an equal opportunity employer. We believe the best team reflects the full diversity of humanity — because we are building for all of it.

Apply Through Training

At TAL Corp you don't just send a résumé — you prove yourself. Apply by joining our training program; complete it, and top performers are hired into this role.

1Register & start your 7-day program
2Train, build real skills, earn a credential
3Top performers → straight into hiring

Already training? Log in

← View all openings