ML Infrastructure Engineer
Later Voir toutes les offres
- Vancouver, BC
- 145.000 $ par an
- Permanent
- Temps-plein
- Define and own the long-term ML infrastructure roadmap, ensuring it supports both current experimentation needs and future AI initiatives.
- Establish best practices for model lifecycle management, deployment standards, monitoring, and governance.
- Identify infrastructure gaps and proactively design scalable solutions to enable high-velocity ML development.
- Contribute to cross-functional technical planning, ensuring ML systems align with product and platform strategy.
- Design, build, and maintain production-grade model deployment and inference systems using CI/CD pipelines, containerized services (Docker), and API frameworks (e.g., Flask).
- Automate end-to-end ML lifecycle workflows including training pipelines, model validation, registry management, deployment, and rollback strategies.
- Implement robust monitoring systems for model performance, latency, drift detection, and infrastructure health using tools such as CloudWatch, Prometheus, and Grafana.
- Operate across AWS and GCP environments to manage training and inference workloads, including GPU-based infrastructure and BigQuery datasets.
- Develop and maintain infrastructure-as-code (Terraform, CloudFormation) to ensure scalable, repeatable, and secure cloud environments.
- Implement and optimize CI/CD workflows (e.g., GitHub Actions, GitLab CI, Bitbucket Pipelines) for ML and infrastructure automation.
- Partner closely with Data Scientists, Analysts, Platform Engineers, and Product Engineers to support end-to-end ML workflows.
- Translate data science experimentation needs into production-ready infrastructure solutions.
- Serve as the technical bridge between ML experimentation and productized deployment.
- Share knowledge and best practices to elevate ML maturity across teams.
- Stay current on emerging ML Ops practices, tools, and frameworks to continuously improve system reliability and efficiency.
- Evaluate and implement model-serving frameworks (e.g., TorchServe, Seldon, TensorRT) where appropriate.
- Contribute to governance, reproducibility, and auditability standards for ML systems.
- Experiment with new tooling and workflows to improve reproducibility, performance, and developer velocity.
- ML models move from experimentation to production quickly and reliably, with minimal manual intervention.
- CI/CD pipelines enable safe, repeatable deployments with clear rollback strategies.
- Model performance, drift, and infrastructure health are proactively monitored and observable.
- Infrastructure supports scalable GPU training and real-time inference without bottlenecks.
- Data scientists report improved velocity, reproducibility, and confidence in deploying models.
- ML systems are secure, compliant, and aligned with evolving product and AI strategy.
- 4+ years of experience in ML Ops, ML infrastructure, backend engineering, or related roles supporting production ML systems.
- Experience working in cloud-native environments (AWS and/or GCP) with hands-on deployment of ML workloads.
- Proven track record designing and implementing CI/CD pipelines for ML systems.
- Strong experience with Amazon SageMaker, Docker, Flask-based APIs, and infrastructure automation tools.
- Hands-on experience with ML lifecycle tooling such as MLflow, SageMaker Studio, or Weights & Biases.
- Experience managing container orchestration platforms (Kubernetes, EKS, or GKE).
- Strong programming experience in Python (additional experience in Go, Java, or Scala is a plus).
- Experience working with infrastructure-as-code tools such as Terraform or CloudFormation.
- Familiarity with observability tools such as CloudWatch, Prometheus, Grafana, Datadog, or centralized logging platforms.
- Experience managing GPU-based workloads and scaling training/inference systems.
- Familiarity with data infrastructure tools such as BigQuery and cloud-native data pipelines.
- Bonus: Experience supporting LLMs or generative AI pipelines, distributed training systems, feature stores (e.g., Feast), real-time inference systems, or ML governance frameworks.
- A mindset focused on automation, reliability, performance, and continuous improvement in fast-scaling environments.
- Driven by Impact: You deliver results that matter-prioritizing high-value work, meeting deadlines, and adapting quickly while keeping outcomes clear.
- Strategic & Customer-Centric: You anticipate risks and opportunities, connect decisions to long-term growth, and build trust through proactive insights.
- Curious & Growth-Oriented: You seek knowledge, ask sharp questions, and apply learnings fast-challenging the status quo with a mindset of improvement.
- Collaborative & Resilient: You thrive in change by staying resourceful, solution-focused, and positive-removing roadblocks, sharing insights, and keeping morale high.
- Accountable & Honest: You own your work, hold yourself and others to a high bar, and use transparent feedback to drive growth.
- Emotionally Intelligent: You build trust through empathy and collaboration, foster inclusion, and inspire others with grit, optimism, and integrity.