Bei Roche kannst du ganz du selbst sein und wirst für deine einzigartigen Qualitäten geschätzt. Unsere Kultur fördert persönlichen Ausdruck, offenen Dialog und echte Verbindungen. Hier wirst du für das, was du bist, wertgeschätzt, akzeptiert und respektiert. Dies schafft ein Umfeld, in dem du sowohl persönlich als auch beruflich wachsen kannst. Gemeinsam wollen wir Krankheiten vorbeugen, stoppen und heilen und sicherstellen, dass jeder Zugang zur Gesundheitsversorgung hat – heute und in Zukunft. Werde Teil von Roche, wo jede Stimme zählt.
Die Position
Job description
As an Infrastructure Provisioning and Management Engineer within the Accelerated Compute Engineering (ACE) team, you will be responsible for overseeing and advancing our core infrastructure management and provisioning tech stack. This role has a strong focus on driving configuration-as-code, infrastructure-as-code (IaC), and modern automated provisioning best practices across our high-performance compute (HPC) and industry-leading AI Factory.
You will own the lifecycle, deployment, and optimization of bare-metal and virtualized compute environments that power Roche's advanced computing initiatives. By treating infrastructure strictly as code and eliminating manual configurations, you will ensure our advanced clusters are highly reproducible, securely patched, and rapidly scalable to meet the evolving demands of computational science and large-scale AI workloads.
Description of the area
Hosting and Infrastructure (HI) provides mission-critical on-premise infrastructure, cloud hosting, connectivity, and technology products that enable all functions at every Roche site to develop, innovate, connect, and deliver compliant digital products across the Roche Enterprise.
The Value Streams - Accelerated Compute Engineering (ACE) Team is focused on driving both customer success and platform success by acting as a center of excellence and delivery for the High Performance Compute and AI Infrastructure supporting AI and HPC use cases across Roche. This team facilitates seamless onboarding and adoption for business vertical customers needing accelerated compute—helping those infrastructure consumers with needs optimized for high availability, seamless data transfer, flexibility, speed, and the rapidly changing needs of AI—helping achieve rapid time-to-value.
Job Responsibilities
Automated Provisioning& Cluster Orchestration
- Design, deploy, and manage large-scale automated provisioning systems for multi-node HPC and AI Factory environments.
- Own and maintain the infrastructure management and provisioning tech stack underpinning the orchestration, monitoring, and provisioning of complex GPU and CPU workloads.
- Streamline bare-metal provisioning and node imaging pipelines to ensure minimal downtime and rapid expansion capabilities.
Infrastructure-as-Code (IaC)& Configuration Governance
- Enforce a strict configuration-as-code and infrastructure-as-code mindset, replacing manual interventions with repeatable automation scripts.
- Author, review, and maintain complex Ansible playbooks and roles for configuration management, patch deployment, and compliance drift remediation.
- Establish robust CI/CD pipelines using GitLab to test, validate, and deploy infrastructure changes safely across development, staging, and production clusters.
Operating System Engineering& Lifecycle Management
- In partnership with Enterprise OS teams, standardize and manage operating system builds, with dual proficiency across HPC and AI Factory platforms.
- Utilize solutions such as Red Hat Image Builder and NVIDIA Base Command Manager to create optimized, compliant, and secure custom golden images tailored for AI and high-performance computing workloads.
- Manage OS lifecycles, including kernel tuning, automated package updates, and vulnerability management, ensuring alignment with global security standards.
Platform Reliability& Collaboration
- Implement proactive monitoring and alerting for infrastructure provisioning health, node availability, and configuration drifts.
- Address and help resolve complex, systemic infrastructure failures, contributing to post-mortem analyses to continuously improve platform resilience.
Qualifications
Education / Experience
- Bachelor’s or an advanced degree in Computer Science, Computer Engineering, or a similar technical discipline.
- 5+ years of experience in systems engineering, DevOps, or platform infrastructure roles, with a proven track record of managing enterprise Linux environments at scale.
- Deep, practical knowledge of operating system internals for both RHEL and Ubuntu OS.
Technical& Business Skills:
- Automation& Orchestration: Advanced capability with Ansible on the command line and experience building scalable infrastructure pipelines using GitLab CI/CD.
- Provisioning Tooling: Experience using NVIDIA Base Command Manager (Bright Cluster Manager) and Red Hat Image Builder (or related tools like Kickstart/Satellite).
- Modern Engineering Mindset: Strong adherence to git-based workflows, code-review methodologies, and infrastructure-as-code principles.
- Troubleshooting Depth: Ability to isolate complex, multi-layered faults bridging hardware, kernel configurations, and automation scripts.
Leadership& Mindset:
- Lean& Agile Mindset: Passionate about continuous improvement, eliminating technical debt, and automating repetitive tasks to achieve scale.
- Collaboration& Communication: Strong collaborative skills with an enterprise mindset, capable of working fluidly across team boundaries to drive platform success.
- Intellectual Curiosity: Highly self-motivated to explore and adopt emerging technologies in the fast-evolving landscape of HPC and AI infrastructure engineering
Wer wir sind
Eine gesündere Zukunft treibt uns zur Innovation an. Mehr als 100.000 Mitarbeiter weltweit arbeiten gemeinsam daran, wissenschaftliche Fortschritte zu erzielen und sicherzustellen, dass jeder Zugang zur Gesundheitsversorgung hat – heute und für zukünftige Generationen. Durch unser Engagement werden über 26 Millionen Menschen mit unseren Medikamenten behandelt und mehr als 30 Milliarden Tests mit unseren Diagnostik-Produkten durchgeführt. Wir ermutigen uns gegenseitig, neue Möglichkeiten zu erkunden, Kreativität zu fördern und hohe Ziele zu setzen, um lebensverändernde Gesundheitslösungen zu liefern.
Gemeinsam können wir eine gesündere Zukunft gestalten.
Roche ist ein Arbeitgeber, der die Chancengleichheit fördert.