We are looking for a MLOps / AIOps / LLMOps / AgentOps Engineer to join a multidisciplinary Data & AI team. The main mission of this role is to design, operate, and continuously evolve our AIOps platform, ensuring that our AI products run in a reliable, scalable, and cost‑efficient way.
This position is strongly focused on platform, infrastructure, automation, observability, and operations rather than on building ML models or AI products themselves.
You will work with modern cloud technologies (mainly AWS, with some Azure exposure) and collaborate closely with Data Scientists, Data Engineers, and Product teams to bring AI solutions into production and keep them running smoothly.
We are open to candidates with strong expertise in at least one core area (e.g. cloud, DevOps, platform engineering, or ML operations) and solid foundational knowledge in the others, with motivation to grow across the full AI operations stack.
Key Responsibilities
Design, maintain, and evolve the AIOps platform supporting:
Traditional machine learning models in production
LLM‑based solutions such as RAG pipelines and AI Agents
Speech Analytics use cases (ASR, conversation analysis, NLP)
Build and operate ML and LLM pipelines with a strong focus on:
Reliability, automation, and observability
Model and LLM quality, performance, and drift monitoring
Cloud cost control and optimization
Implement LLMOps / AgentOps practices, including:
LLM evaluation and observability
Prompt management, traceability, and specialized logging
Agent integration, orchestration, and lifecycle management
Ensure continuous operation of AI products, including:
Alerts, dashboards, SLOs / SLIs
Scalability strategies and basic auto‑remediation mechanisms
Manage deployments in cloud environments (AWS / Azure) and container platforms (Docker / Kubernetes)
Collaborate closely with Data Scientists and Data Engineers to productionize robust, scalable AI solutions
Contribute to internal standards, automation, and best practices across the AI and data ecosystem
Required Skills (Must Have)
Hands‑on experience in MLOps, AIOps, or operating ML systems in production
Solid understanding of LLMOps and AgentOps concepts (RAGs, agents, evaluation, monitoring)
Experience working with AWS and/or Azure in production environments
Practical knowledge of containers and Kubernetes (Docker, basic Helm usage, etc.)
Experience with CI/CD pipelines (GitHub Actions, GitLab CI, Azure DevOps, Jenkins, or similar)
Familiarity with observability and monitoring concepts (CloudWatch, OpenTelemetry, Prometheus, etc.)
Experience managing infrastructure as code (Terraform, Bicep, CDK, or similar)
Python experience and familiarity with the ML ecosystem (e.g. scikit‑learn, PyTorch), even if not a Data Scientist
Good understanding of the ML / LLM lifecycle, from development to production and monitoring
Fluent English to work in an international environment
Nice to Have (Not Required, but Valuable)
Experience with ML/AI platforms such as SageMaker, Azure ML, MLflow, Kubeflow
Exposure to Speech Analytics technologies (ASR, diarization, conversational NLP)
Experience with cloud cost optimization / FinOps, especially for AI workloads
Experience building or operating AI agents, copilots, or conversational systems
Familiarity with LLM frameworks (LangChain, LlamaIndex, Semantic Kernel, etc.)
Experience with workflow and orchestration tools (Airflow, Argo, Step Functions, Durable Functions)
Professional Skills & Mindset
Strong focus on reliability, automation, and scalability
Ability to collaborate effectively in multidisciplinary teams
Clear communication and documentation‑oriented mindset
Platform mindset: building reusable, maintainable, and robust solutions
Proactive, analytical, and continuous‑improvement driven
Strong sense of ownership and end‑to‑end responsibility
Motivation to learn and grow across the AI operations stack
Technology Environment
Cloud: AWS, Azure
Orchestration & Containers: Kubernetes, Docker
CI/CD: GitHub Actions, GitLab CI, Azure DevOps
Observability: Prometheus, Grafana, ELK/EFK, OpenTelemetry
Infrastructure as Code: Terraform, Bicep, CloudFormation
AI / ML Tools: MLflow, Azure ML, SageMaker, LangChain, LlamaIndex, Semantic Kernel
Primary Language: Python