Job Information
Nvidia Senior DevOps Engineer in Yokneam, Israel
We are seeking a Senior DevOps Engineer to join our Farm team to improve its growing services infrastructure. You will be working with a team of passionate and skilled engineers who are continuously working to provide better tools to build and manage our infrastructure. Our team is a mix of varying levels of experienc e. We need a motivated, hardworking and focused individual who has a real passion for operational excellence, data systems, and automation.
What you'll be doing:
Own the services you build working with cross functional teams
Comfortable with frequent code testing and deployment
Continuously improve infrastructure provisioning and management using automation
Identify areas to improve service resiliency through industry standard practices
Support a globally distributed, On-Prem environment (LSF)
Determine root-cause for production level incidents and write corresponding high-quality RCA reports
Ensure the highest level of up-time and Quality of Service (QoS) to internal customers through operational excellence
Participate in team's on-call rotation
What we need to see:
B.S. degree in Computer Science or related technical field or equivalent experience
8 + years coding/scripting in at least two high level programming languages - Python, Perl, Go, Ruby, Groovy etc.
Build and maintain scalable web applications using modern front-end frameworks, back-end technologies, databases, APIs, and cloud platforms.
Good Knowledge in operating services including web servers, load balancers, relational/non-relational databases, messaging systems and storage solutions
Deep understanding of linux operation system and TCP/IP fundamental .
Knowledge in high-performance computing environments, including job schedulers (e.g., Slurm, PBS, or Grid Engine), parallel computing, and performance tuning.
Expertise with at least one major cloud service provider- AWS, GCP, Azure
Proficient in implementing and managing monitoring tools like Grafana and Prometheus, ensuring system performance, reliability, and real-time data visualization.
Proficient in modern CI/CD techniques, GitOps and Infrastructure as Code(IaC)
Detail oriented with great communication and documentation skills
Ways to stand out from the crowd:
Develop, fine-tune, and deploy advanced LLM-based solutions for [specific applications, e.g., NLP, chatbots, content generation, or data analysis
Linux certification from a well known vendor - RedHat, Oracle etc.
Prior experience managing large scale Kubernetes deployment in production
Strong skills in modern container networking and storage architecture