Experience Inc. Jobs

Job Information

Microsoft Corporation Senior Site Reliability Engineer Manager in Bucharest, Romania

Come build and maintain the world’s computer as a member of the Microsoft Capacity Infrastructure Services team in Azure Core. The team ensures new servers are brought online (capacity buildout/provisioning) to enable Azure customers to leverage the latest offerings, see the illusion of infinite capacity, and grow the Azure business efficiently at hyperscale. You’ll also complete the cycle by safely taking old capacity offline (decommissioning/deprovisioning) and provisioning new capacity again in its place thus ensuring the cloud remains healthy and current.

As a Senior Site Reliability Engineering Manager, you’ll grow your team of site reliability engineers and service engineers to work with a breadth of partners across Microsoft including developers in service teams, hardware engineers, network engineers, datacenter technicians, supply chain managers, and business leaders to rapidly debug and resolve issues delaying the carefully orchestrated buildout and decommissioning sequences. You’ll drive continuous improvements with these teams to prevent repeats and address common classes of issues across the Azure software stack through design reviews and problem management.

This opportunity will enable you to learn unparalleled system-wide knowledge of how the Azure cloud is built and maintained while growing your people management skillset. The contacts you make with experts will enable you to deep dive on services and new technologies and partner for improvements. You’ll be stretched to automate mitigations tactically to cloud scale and strategically analyze data to identify problem areas for driving improvements to meet business needs.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Responsibilities

  • Demonstrates end-to-end expertise in distributed systems design, interactions between cloud technology layers and components, functions of physical network devices, and dependencies at scale. Drives efforts within an organization to identify and recommend optimal configurations of cloud technology solutions and develops or modifies the code base that defines infrastructures to improve the reliability and operability of supported products.

  • Develops end-to-end technical expertise in the architecture, code, features, and operations of specific products as required to implement improvements in product availability, reliability, efficiency, observability, and/or performance. Drives code/design reviews with the engineering teams that develop and/or manage those products and shares learnings and recommendations across engineering teams working on related products within their organization.

  • Researches and maintains deep knowledge of industry trends and advances in large-scale distributed systems and cloud technologies; manages efforts to research, develop, implement, and optimally utilize new tools, technologies, and/or processes to solve ambiguous problems and improve the availability, reliability, efficiency, observability, and/or performance of their team's supported products. Monitors the implementation of new tools, technologies, and processes as well as their impact on reliability, efficiency, observability, and/or performance to make recommendations for broader adoption within an organization.

  • Manages partnerships between Site Reliability Engineering (SRE) and product engineering teams to identify and implement changes to the code base to improve availability, reliability, efficiency, observability, and performance of related sets of products within an organization. Reviews and provides feedback on recommendations provided by SREs and ensures they have the technical expertise and data to justify and gain buy-in for their recommendations from product teams and owners.

  • Drives, and contributes to, the development of automation tools to reliably automate moderately complex but repetitive operations processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale within an organization; reviews existing and newly developed automation tools to evaluate and provide feedback on reusability, extendibility, and scalability. Ensures automation tools and systems developed within an organization are tested and the impact of their deployments is monitored.

  • Oversees a team of Site Reliability Engineers (SREs) using existing tools and/or models to identify contributing factors and points of failure affecting availability, reliability, performance, and/or efficiency of systems, platform, and/or products; provides guidance, recommendations, and feedback to SREs to help them troubleshoot problem and to identify and test scalable solutions that can prevent the occurrence of similar issues in related products within their organization.

  • Participates in on-call rotations and manages teams of Site Reliability Engineers (SREs) responding to incidents during regular on-call rotations to identify the level of impact, troubleshoot issues, and deploy appropriate fixes to resolve root cause(s) and prevent recurrence across related products. Ensures that SREs within an organization have the technical knowledge and resources required to respond to incidents, that relevant engineering teams, stakeholders, leaders are alerted to customer impacting issues, major issues are escalated to other teams as needed, and that key details related to incidents and their resolution are shared through post-mortem reports and during regular review meetings.

Qualifications

Required Qualifications:

  • Technical experience in software engineering, network engineering, or systems administration

  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND technical experience in software engineering, network engineering, or systems administration

  • OR Master's Degree in Computer Science, Information Technology, or related field AND technical experience in software engineering, network engineering, or systems administration

Other Requirements:

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: 

  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Preferred Qualifications:

  • Technical experience in software engineering, network engineering, or systems administration

  • OR Doctorate Degree in Computer Science, Information Technology, or related field

  • Technical experience working with large-scale cloud or distributed systems

  • People management experience

#azurecorejobs

Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations (https://careers.microsoft.com/v2/global/en/accessibility.html) .

DirectEmployers