
Site Reliability Developer 1
- Canada
- Permanent
- Temps-plein
- Support key ITIL processes, including Incident management, request management, problem management and change management.
- Define and document runbooks and standard operating procedures.
- Field operational requests from our Application Support team and other internal stakeholders
- Triage and solve issues within defined SLA's to ensure an excellent customer experience and to unblock other development and support teams
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
- Identify and troubleshoot problems, investigate root causes, and champion fixes across the organization.
- Work with infrastructure-as-Code (IaC) with a focus on continuous improvement.
- Collaborate with cross-functional team members on features and implementation within an agile environment.
- Report on SLAs and performance metrics as part of the Operations function.
- Participate in on-call rotation.
Please note this reflects only a portion of our current technical stack, and we are constantly evolving and revisiting our stack as we grow:
- A modern AWS cloud infrastructure managed through infrastructure-as-code (Terraform), configuration-as-code (Ansible), and CI/CD (Jenkins)
- RDS MySQL, Redshift, Redshift Spectrum, MongoDB, and Elasticsearch
- Kinesis, SQS, and RabbitMQ
- DevOps tools written in Python
- Back-end applications written using Java, Dropwizard, Spring Boot, and Hibernate
- Front-end applications written using TypeScript, JavaScript, React (Context Api and Hooks), and Redux
- Monitoring with DataDog, and CloudWatch
- Bachelor's degree in computer science, Software engineering or equivalent experience
- 2+ years of experience in an IT Operational, DevOps, SRE, or Software Engineering role.
- Experience with cloud computing (AWS and Azure) services and a developing-level of knowledge with the management and setup of cloud infrastructure.
- You can write code - in any language. You have implemented your work in a production environment and can back it up with examples.
- Experience with tools and platforms such as: Ansible, Build/Release Pipelines, Docker, Github, Terraform etc.
- Developing-level of knowledge with distributed systems in the cloud using observability and telemetry for oversight of code deployments and service level objectives (SLOs).
- Developing experience with the operational aspects of software systems using telemetry, centralized logging, and alerting with tools such as: CloudWatch, Datadog, Prometheus, etc.