Experience Inc. Jobs

Job Information

Covetrus North America, LLC Site Reliability Manager (No H1 transfers will be considered at this time) in Dublin, Ohio

Covetrus is dedicated to advancing the world of veterinary medicine and empowering veterinary healthcare teams across the companion, equine, and large-animal health markets. We provide a comprehensive suite of products, software, and services to help drive improved patient health, strong client relationships, and successful financial outcomes for veterinary professionals. SUMMARY The role of the Site Reliability Manager is responsible for the stability and performance of our production environment. Our customers' experience is critical to our success, which makes the health of our platform the highest priority for our SRE team. Additionally, this role will be responsible for software release management and establishing monitoring and alerting criteria in keeping with best practices for a high availability platform. The candidates for this role should have extensive knowledge of Dynatrace for monitoring and alerting, the ability to create/modify dashboards and leverage synthetic monitoring to alert degradation of the user experience. A strong familiarity with Azure, the Azure portal and Kafka is preferred and a good working knowledge of Splunk, Kong for the application gateway and Cloudflare for the firewall. We depend heavily on pager duty for alerting the team and maintain a 24/7 on call rotation. Since this is considered a leadership position, the ability to mentor and manage the continual improvement of the team will be essential. ESSENTIAL DUTIES AND RESPONSIBILITIES include the following. Other duties may be assigned. * Develop methodologies for monitoring and operating highly available and scalable services. * Work with the DevOps team to create more scalable and resilient infrastructure. * Proactively monitor and review application performance. * Monitor specific metrics, set thresholds, and trigger alerts based on those thresholds. * Collect and analyze logging and diagnostic information. * Help develop better monitoring and incident resolution practices. * Troubleshoot business and production issues. * Properly document all incident responses. * Provide updates and documentation to runbooks and operational manuals. * Document mean time to recover (MTTR) and mean time to failure (MTTF). * Participate in on-call rotations. * Evaluate, build and modify automation for deploying and operating production services. * Provide leadership in reducing and resolving production incidents. * Mentor and develop site reliability engineers. * Create a culture of reliability and high availability in the Information Technology department to improve customer satisfaction. * Train application development resources to build more resilient applications. * Identify opportunities to improve all operations processes. * Facilitate effective transition of services into production ensuring that all requirements have been met in accordance with our Change Management standards. (release management) * Regularly reviews deployment configurations and makes recommendations for optimal performance in terms of hardware and scaling SUPERVISORY RESPONSIBILITIES * Technical leadership for a nine-person team QUALIFICATIONS: EDUCATION AND/OR EXPERIENCE * Bachelor's degree in software engineering or computer science and/or related years of experience * Minimum 3 years in an SRE role for a highly available environment * Minimum 1 year in a similar leadership role. * Experience with Kafka, Kong, Elastic Search is a plus COMPETENCIES (SKILLS AND ABILITIES) * Strong problem-solving and troubleshooting skills with a sense of urgency to restore services for our customers * Versatile with a passion to learn * Ability to understand the 'big picture' * Strong skills with Dynatrace and Azure Portal * History of self-improv

DirectEmployers