
Senior Site Reliability Administrator(Intermediate Senior level)
- Mississauga, ON
- Permanent
- Temps-plein
- Uses technical knowledge, creativity, and company practices to drive down occurrences of incidents through development of proactive monitoring and alerting.
- Provide attention to incidents according to Service Level Agreements.
- Provide continuous feedback to development teams on system stability, defect analysis and system enhancements
- Develop runbooks and patterns to sustain applications in a production environment
- Participate in technical discussions and drive transition to sustain activities with the development teams
- Work with IT business and development partners to gather input to develop new capabilities in displaying/monitoring/alerting on key performance indicators (KPIs) by tracking business transactions (BT) in real-time
- Partner with application owners to develop creative and effective solutions to mitigate risk and successfully remediate any audit issues, providing quality and timely responses
- Take ownership and accountability for the incident resolution process, participating in RCA and SWAT investigations.
- Plan for validation and verification of changes deployed by infrastructure teams, development teams.
- Participate in day-to-day real time advanced level technical support and troubleshooting on issues reported from user/customer base.
- Provides guidance in resolving performance related issues and designing solutions for any technical issues faced by the application
- Establish and maintain a good relationship with team members, Product Development, Product management, Customer Service, Client management and other cross functional teams.
- Participate in training and information sharing activities.
- Act as backup for other team members when necessary.
- Requires rotating shift work as needed.
- On-call rotation is required, as 7x24x365 support is required.
- The ability to understand and maintain Scripting software
- Deep understanding of Linux systems
- Hands on experience with cloud infrastructure; Google, AWS or Azure a plus
- Experience with PaaS technologies such as Cloud Foundry, Kubernetes, Bosh.
- Good understanding and operational experience with container technologies like Docker, rkt, mesos.
- Good understanding and working experience with micro services and RESTful architecture.
- Experience with Continuous delivery tools like GitOps, Ansible, Rundeck or Argo CD to setup automated pipelines as needed.
- Strong working knowledge of aPaaS or Application operations best practices.
- Operational understanding or experience with message brokers such as Apache Kafka or RabittMQ.
- Operational understanding or experience with search technologies such as Solr search or Elasticsearch.
- Experience in supporting middle-ware technologies such as Apache, Tomcat, Spring.
- Experience with at least one scripting languages such shell, perl, python, javascripts, etc…
- Experience with installing and configuring Apache and Tomcat.
- Experience in supporting Java applications built using frameworks such as spring, struts, spark, etc.
- Experience and knowledge in RDBMS and No-Sql databases such as Oracle, Postgres, MariaDB and Cassandra.
- Deep expertise in Monitoring distributed systems application architectures and the ability to correlate environment conditions and metrics to application events.
- Experience with APM tools such as Newrelic, Dynatrace or AppDyanmics.
- Experience with monitoring tools such as Zabbix or check_mk.
- Knowledge and familiarity of centralized logging systems such as graylog or Kibana.
- Strong understanding of ITIL principles, certification is a plus.
- Is passionate about “getting under the hood” of systems and technologies to understand their inner workings, and fix what needs fixing. This requires diagnosing & troubleshooting user facing service incidents & outages
- Knowledge and familiarity of API gateway such as APIGEE and Oauth 2.0 standard.
- Diagnosing, resolving problems in high-throughput web applications & network services
- Proven problem solving and analytical ability.
- Excellent organizational/time management skills.
- Ability to handle multiple tasks concurrently.
- Ability to lead, drive and implement highly scalable and complex solutions
- A strong understanding of Security best practices.
- A proven record of being able to work independently and collaboratively.