We are seeking an experienced DevOps to play a pivotal role in bridging the gap between software engineering and operations. This role emphasizes designing robust solutions, mentoring teams, and driving performance improvements for both internal and client systems through expertise in automation, scalability, and system reliability.
As a Senior DevOps, you will be responsible for owning the uptime and performance of critical infrastructure and applications while working closely with clients to align reliability goals with their business objectives.
What we offer.
- The chance to become part of a large, international agency group: Approx. 350 great employees in Germany, a modern, customer-centric business model and a true New Work environment. As part of MSQ with a further 1500 colleagues, we offer an international perspective with offices and sister agencies in the UK, USA and Asia
- Flexible working environment: Hybrid or in our beautiful office in Barcelona (see our office)
- Agile organization: Self-organized teams, extensive onboarding, personal buddy and lots of support from experienced colleagues
- Work & life balance: Job bike leasing, Smart'n'fit weeks, fitness and self-care offers via EGYM Wellpass
- And of course: Modern hardware, well-equipped tool landscape, company pension scheme, further training budget, in-house training, company events, drinks flat rate and good coffee
What you do.
System Reliability & Performance
- Own uptime and performance of critical infrastructure and apps
- Design scalable, fault-tolerant architectures; optimize efficiency
- Define/govern NFRs (availability, performance, maintainability) for systems
- Identify opportunities for optimization, scalability, and cost reduction
Automation & Infrastructure
- Design/implement automation for monitoring, incident response, and repetitive tasks
- Use IaC/CaC (Terraform, ARM, CloudFormation) for provisioning and data pipelines
- Design/deploy Docker/Kubernetes solutions on cloud platforms
- Manage CI/CD pipelines for cloud-native apps (GitHub Actions, Azure DevOps)
Cloud Infrastructure & Operations
- Contribute to architecture; implement Azure/AWS with high availability
- Optimize resources for performance, cost, security
- Apply cloud-native best practices; ensure compliance
- Integrate monitoring/alerting (Datadog, CloudWatch, App Insights) for multi-cloud observability
Incident Management & Analysis
- Lead incident response; root cause analyses; blameless postmortems
- Collaborate to embed observability in the development lifecycle
- Build executive/developer dashboards for key metrics
What you bring along.
Technical Expertise
- Proven experience in running and maintaining production systems with expertise in triaging and solving incidents
- Proficiency in automation and configuration management tools (e.g., Terraform, Ansible)
- Expertise in cloud platforms, particularly Azure and AWS, and their associated tools
- Strong programming skills, with a primary focus on Python, for developing automation scripts, creating custom tooling, and optimizing operational workflows
- Experience with modern observability platforms such as Datadog
- Strong network fundamentals with hands-on experience in Palo Alto next-generation firewalls, including configuration, monitoring, and troubleshooting in enterprise environments
- Experience with Microsoft Identity solutions including Azure Active Directory (Entra ID), identity governance, and integration with enterprise authentication systems
Skills & Experience
- A solid foundation in system architecture, with a focus on scalability and reliability
- Exceptional problem-solving skills and a data-driven mindset
- Familiarity with CI/CD pipelines and tools like GitHub Actions and Azure DevOps
Desirable Requirements
- Experience with container orchestration tools such as Kubernetes
- Knowledge of security best practices in cloud and hybrid environments
- Experience with identity and access management solutions (e.g. Okta, Active Directory, Cyber Ark) including role-based access control and authentication protocols
- Experience with network monitoring tools and infrastructure-as-code approaches to network configuration (e.g., Terraform for cloud networking, Ansible for network devices)