Sovrn Holdings, Inc. Staff Site Reliability Engineer in Boulder, Colorado, United States

Job Information

Sovrn Holdings, Inc. Staff Site Reliability Engineer in Boulder, Colorado

At Sovrn, the Reliability team plays an integral role building and maintaining our low latency, high performance, scalable infrastructure. Reliability engineering focuses on enabling frameworks that streamline full stack delivery using automation to increase feature velocity while meeting production service level commitments. Oversee the management of AWS cloud infrastructure, including provisioning, configuration, and optimization across multiple accounts. Implement best practices for resource allocation, cost optimization, and scalability within the AWS environment. Monitor and maintain the health, performance, and security of AWS services and resources. Design, implement, and maintain networking configurations for both cloud-based and office environments. Ensure seamless connectivity between cloud resources and on-premises infrastructure. Configure and manage virtual private clouds (VPCs), subnets, route tables, security groups, transit gateways, peering connections, access control lists (ACLs) and client VPN to support business needs. Provide expertise and insights into container automation and deployment strategies to ensure reliability and efficiency. (Kubernetes, EKS, ECS) Analyze deployment processes to identify areas for improvement and optimization. Implement monitoring and alerting solutions to detect and respond to issues affecting containerized applications. Optimize container orchestration platforms such as Kubernetes, EKS, ECS for improved performance, scalability, and reliability. Implement best practices for container lifecycle management, including deployment, scaling, and updates. Work closely with development teams to streamline CI/CD pipelines for containerized applications. (Helm, Jenkins, GitHub, Artifactory) Design and implement architectures to ensure high availability and fault tolerance of cloud-based services. Implement redundancy and failover mechanisms to minimize downtime and service disruptions. Conduct regular testing and simulations to validate the resilience of cloud environments. Collaborate with development teams to understand application requirements and optimize deployment processes. Provide guidance on infrastructure requirements and best practices for deploying and scaling applications in the cloud. Implement monitoring and logging solutions to track the performance and health of network and cloud infrastructure. (Cloudwatch, CloudTrail, Grafana, Prometheus, Datadog) Troubleshoot and resolve issues related to network connectivity, resource utilization, and application performance. Develop and maintain incident response procedures to ensure timely resolution of critical issues. Implement and enforce security best practices for cloud and network environments, including access control, encryption, and compliance. (Systems Manager, Guardduty, Security Hub) Conduct regular security assessments and audits to identify and address vulnerabilities. Develop automation scripts and tools using Python, shell scripting, and other programming languages to streamline operational tasks. System Administration (Shell, Bash, Python) Automate infrastructure provisioning, configuration management (Ansible), and deployment processes to improve efficiency and reliability. Leverage infrastructure-as-code (IaC) tools such as Terraform to define and manage cloud resources programmatically. Continuously evaluate and implement improvements to enhance the reliability, performance, and scalability of cloud infrastructure. (May telecommute from anywhere in the U.S.)

Minimum Requirements:

Education: Master's degree in Software Engineering, Computer Science or related engineering field.

Experience: Two (2) years of experience as Site Reliability Engineer, Software Engineer, Systems Engineer or related

Skills: Cloud Platform (AWS); container orchestration tools (Kubernetes and EKS); Helm; Jenkins; GitHub; Ansible; CloudFormation or Terraform; monitoring t ols (CloudWatch, Datadog, Grafana or Prometheus); Shell; Bash; and Python.

APPLY TO: Email resume to: peopleteam@sovrn.com and reference Job #ME011

Apply Now

Experience Inc. Jobs

Job Information

Sovrn Holdings, Inc. Staff Site Reliability Engineer in Boulder, Colorado

Current Search Criteria