Join the Evaluation Integrity team to help build the trusted quality signal behind every Siri release. Design and run human-in-the-loop annotation projects to evaluate the quality and authenticity of agentic user personae, the validity of agent-to-agent conversations, and the reliability of LLM-as-judge and rule-based evaluators against Siri’s product specifications.
Requirements
- Bachelor’s or Master’s degree in a quantitative or related field
- 3+ years of hands-on experience working with human-annotated datasets or human-in-the-loop evaluation methodologies
- 3+ years of experience using Python for data processing, analysis, and prototyping
- Experience designing, implementing, and communicating annotation schemas, rubrics, or ontologies
- Experience managing multiple concurrent dataset curation efforts
- Experience specifying or designing custom annotation tooling in collaboration with software engineers
To apply for this job please visit jobs.apple.com.

Follow us on social media