Databricks Machine Learning (ML) Administrator
Applied Materials Voir toutes les offres
- Ontario
- Permanent
- Temps-plein
- Deploy, configure, and maintain Databricks ML clusters (CPU/GPU), SQL Warehouses, and cluster policies optimized for ML workloads; apply autoscaling, pools, and runtime selection (including Databricks Runtime for ML).
- Administer Jobs and Pipelines that orchestrate training, evaluation, and batch/real‑time scoring; manage run‑as identities and default privileges to meet least‑privilege requirements.
- Establish and enforce compute access controls (attach/restart/manage) and workspace object permissions; standardize policies to prevent configuration drift.
- Govern MLflow Experiments and Registered Models with fine‑grained permissions (read/edit/manage), standardizing experiment tracking, model versioning, stage transitions, and approvals.
- Operate and secure model serving endpoints, including permissions for view, query, and manage actions; implement change control for deployments.
- Coordinate with data governance to implement metastore, catalog, schema, and table‑level permissions that support feature engineering, training, and evaluation while safeguarding sensitive data.
- Apply enterprise identity and access management patterns across account and workspace scopes (users, groups, service principals) using SCIM/SSO standards.
- Enforce workspace object ACLs, compute isolation modes, secret handling, and log‑access controls for ML clusters; implement Spark ACL settings per policy.
- Operationalize system tables/audit logs and usage analytics to meet regulatory and internal control requirements; partner with Security/GRC for periodic reviews.
- Monitor cluster health, job success/failure, serving endpoint SLOs, and capacity; establish alerting and incident runbooks for ML infrastructure.
- Lead post‑incident reviews and continuous improvement for platform reliability and developer productivity.
- Implement and iterate compute policies, budget policies, and usage dashboards to optimize GPU/CPU consumption for ML training and serving.
- Define and evangelize ML platform standards: environment baselines, cluster policies, experiment hygiene, model promotion flows, and serving change‑management.
- Partner with ML teams to align platform features (AutoML, Feature/Vector stores, model serving) to use cases and performance targets.
- 5+ years administering Databricks or similar ML/data platforms (e.g., Spark‑based platforms) with hands‑on experience in workspace administration, compute policies, and MLflow governance.
- Proven expertise managing Databricks permissions (workspaces, clusters, jobs, experiments, registered models, serving endpoints) via UI, REST/CLI.
- Strong understanding of Unity Catalog concepts and implementing catalog/schema/table access for ML workflows.
- Working knowledge of Python/Scala sufficient to understand notebooks, init scripts, and operational tooling (no application development required).
- Experience with SSO/SCIM, enterprise identity providers, and group‑based access patterns across account and workspace scopes.
- Familiarity with audit logging, system tables, and cost‑management techniques in Databricks.
- Databricks Platform Administrator accreditation (or equivalent) and experience with serverless/SQL warehouses, cluster pools, and model serving.
- Experience operationalizing run‑as service principals for jobs and pipelines and separating ownership vs. execution permissions.
- Exposure to infrastructure‑as‑code (e.g., Terraform) for permissions/policies and environment baselining.
- Understanding of data protection controls (masking, row/column access) and secure handling of secrets and logs in ML contexts.
- Databricks Workspace & Account Console, Unity Catalog, Jobs, Pipelines, MLflow, Model Serving, Databricks Runtime for ML, SQL Warehouses.
- Databricks CLI/REST APIs for permissions and automation; optional IaC (Terraform) for policy/permission as code.