Senior Engineer, Enterprise AI Services
Thomson Reuters Voir toutes les offres
- Toronto, ON
- Permanent
- Temps-plein
- Serve as the Kubernetes expert for AI services, defining and operating deployment standards for scalability, resilience, security, and performance.
- Own the AI observability platform, implementing tools such as Braintrust and Langfuse to support tracing, evaluation, analytics, and monitoring of LLM/ML workloads.
- Define and standardize telemetry across AI products, including traces, metrics, logs, evaluations, and feedback, while ensuring governance, privacy, and auditability requirements are met.
- Build telemetry pipelines, dashboards, and reporting that provide clear visibility into model performance, quality, safety, reliability, and cost.
- Establish monitoring, alerting, SLOs/SLIs, and incident response practices for AI systems, including root cause analysis and continuous improvement.
- Integrate observability and evaluation into CI/CD so new models, prompts, and workflows are automatically enrolled in monitoring and quality controls.
- Partner with Product, Data Science, AI Engineering, SRE, Platform, and Cloud teams to onboard new AI use cases, support experimentation and drift detection, and implement guardrails and policy enforcement.
- Strong understanding of LLM/ML fundamentals and production AI systems, including prompting, context windows, RAG, hallucinations, and model/provider variability and capacity.
- Hands-on experience with AI observability and evaluation platforms, with Braintrust and/or Langfuse strongly preferred. Solid background with modern observability tooling, ideally Datadog.
- Deep Kubernetes experience deploying and operating services in production, including Helm-based releases, ingress/service networking, scaling, and troubleshooting.
- Proficiency in Python plus at least one additional backend or platform language such as TypeScript/JavaScript, Go, or Java. C# is an asset.
- Experience with API workloads across AWS, Azure, or GCP, including custom endpoints, and cloud-native production environments.
- 5+ years in SRE, Observability Engineering, ML Platform, or similar roles, including 2+ years supporting production LLM/ML systems, preferably in enterprise or regulated environments.
- Strong collaboration and communication skills, with the ability to work effectively across Product, Engineering, Data Science, SRE, and Cloud teams.
- Experience working in Agile environments, with familiarity in iterative delivery, sprint planning, backlog refinement, and cross-functional team execution.
- Demonstrated curiosity about AI technologies and a strong willingness to learn, adapt, and stay current with the rapidly evolving LLM/ML landscape.
- Strong problem-solving skills, sound judgment during incidents, and a continuous improvement mindset.