Principal Engineer Ml Ops

Abu Dhabi, United Arab Emirates

Job Description

About KATIM
KATIM is a leader in the development of innovative secure communication products and solutions for governments and businesses. As part of the Space & Cyber Technologies cluster at EDGE, one of the world's leading advanced technology groups, KATIM delivers trust in a world where cyber risks are a constant threat, and fulfils the increasing demand for advanced cyber capabilities by delivering robust, secure, end-to-end solutions centered on four core business units: Networks, Ultra Secure Mobile Devices, Applications, and Satellite Communications. Our talented team of cross functional experts continually takes-on new challenges. We work with the energy of a start-up yet the discipline of a large business to make solutions and products work for our customers at scale.
Job Purpose (specific to this role)
The Principal MLOps Engineer / AI Infrastructure Architect defines and leads the design, automation, and governance of KATIM's AI infrastructure powering mission-critical, secure communications products. This role drives end-to-end MLOps strategy - from model governance and deployment automation to compliance enforcement - ensuring every AI capability adheres to zero-trust and sovereign-data principles. This role bridges applied machine learning, software engineering, and DevSecOps - ensuring that AI models, pipelines, and applications are deployed, monitored, and scaled securely across environments.
You will operate in an AI-augmented product development environment, where infrastructure provisioning, model training, deployment, and monitoring are automated or AI-assisted. Your mission is to build systems that empower product teams to deliver production-grade AI capabilities with the speed, reliability, and compliance standards required by government and enterprise clients.
You will collaborate closely with cross-functional teams across platform engineering, product security, and product management to design MLOps pipelines that embody resilience, traceability, and zero-trust security principles. This role is ideal for engineers passionate about building intelligent automation frameworks, scalable AI pipelines, and secure data infrastructures that form the foundation of mission-critical AI systems.
AI-Augmented Product Development Model (Context for the Role)
We are a lean, AI-augmented product development team delivering production-ready, enterprise-grade secure applications with the quality and velocity typically requiring 3-4x larger organizations. Every role in the traditional SDLC is either AI-augmented or automated, allowing teams to focus on design, architecture, and high-impact decision-making.
Core Principles

  • Security is integrated into every decision, from architecture to deployment.
  • Repetitive tasks are automated; human effort is focused on strategy and problem-solving.
  • Quality is measurable, enforced, and automated at every stage.
  • All system behaviors-including AI-assisted outputs-must be traceable, reviewable, and explainable. We do not ship "black box" functionality.
  • Design and implementation decisions prioritize resilience, maintainability, and interoperability in real customer environments.
Key Responsiblities
AI MLOps Architecture & Governance (30%)
  • Define the MLOps architecture and governance framework across products.
  • Design secure, scalable AI platform blueprints covering data, training, serving and monitoring layers.
  • Standardize model registries, artifact signing, and deployment processes for air-gapped and on-prem environments.
  • Lead architectural designs and reviews for AI pipelines.
  • Design and maintain LLM inference infrastructure
  • Manage model registries and versioning (MLflow, Weights & Biases)
  • Implement model serving solutions (TensorFlow Serving, TorchServe, vLLM)
  • Optimize model performance and cost (quantization, caching, batching)
  • Build and maintain vector databases (Pinecone, Weaviate, Chroma)
  • Hardware and inference optimization awareness
Agent & Tool Development (25%)
  • Develop and maintain AI agents for automation (e.g., automated test generation, log intelligence, and anomaly detection).
  • Build AI-assisted DevSecOps utilities to automatically enforce compliance, logging, and audit policies.
  • Build tool integrations for LLM agents (function calling, APIs)
  • Implement retrieval-augmented generation (RAG) pipelines
  • Create prompt management and versioning systems
  • Monitor and optimize agent performance
CI/CT/CD Pipelines (20%)
  • Build continuous integration pipelines for models and code
  • Implement continuous training (CT) workflows
  • Automate model deployment with rollback capabilities
  • Create staging and production deployment strategies
  • Integrate AI-assisted code review into CI/CD
  • Building a continuous evaluation loop
Infrastructure & Automation (15%)
  • Manage cloud infrastructure (Kubernetes, serverless)
  • Implement Infrastructure as Code (Terraform, Pulumi)
  • Build monitoring and observability systems (Prometheus, Grafana, DataDog)
  • Automate operational tasks with AI agents
  • Ensure security and compliance (OWASP, SOC2) - AI-specific security
Developer Enablement (10%)
  • Provide tools and libraries for engineers to adopt AI-augmented workflows securely.
  • Document AI/ML best practices and patterns
  • Conduct training on MLOps tools and workflows
  • Support engineers with AI integration challenges
  • Maintain development environment parity
  • AI Privacy, Governance, and Compliance
Education and Minimum Qualification
  • BS in Computer Science, Computer Engineering, Mathematics or related field, or equivalent experience in industry / Masters preferred.
  • 8+ years in DevOps, SRE, or platform engineering
  • 5+ years hands-on experience with ML/AI systems in production
  • Deep understanding of LLMs and their operational requirements
  • Experience building and maintaining CI/CD pipelines
  • Strong Linux/Unix systems knowledge
  • Cloud platform expertise (AWS, GCP, or Azure)
  • Experience with container orchestration (Kubernetes)
Key Skills
MLOps & AI:
  • LLM Integration: OpenAI API, Anthropic API, HuggingFace, Azure OpenAI
  • Model Serving: TensorFlow Serving, TorchServe, vLLM, Ollama
  • Experiment Tracking: MLflow, Weights & Biases, Neptune.ai
  • Model Registries: MLflow, Kubeflow, AWS SageMaker
  • Vector Databases: Pinecone, Weaviate, Chroma, Milvus
  • Agent Frameworks: LangChain, LlamaIndex, AutoGPT, Semantic Kernel
  • Fine-tuning: LoRA, QLoRA, prompt tuning
Data Engineering:
  • Pipelines: Airflow, Prefect, Dagster
  • Processing: Spark, Dask, Ray
  • Streaming: Kafka, Pulsar, Kinesis
  • Data Quality: Great Expectations, dbt
  • Feature Stores: Feast, Tecton
DevOps & Infrastructure:
  • Containers: Docker, Kubernetes, Helm
  • Cloud Platforms: AWS (SageMaker, Lambda, ECS) OR GCP (Vertex AI, Cloud Run) OR Azure (ML Studio)
  • IaC: Terraform, Pulumi, CloudFormation
  • CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
  • Orchestration: Kubernetes operators, Kubeflow
Monitoring & Observability:
  • Metrics: Prometheus, Grafana, CloudWatch
  • Logging: ELK Stack, Loki, CloudWatch Logs
  • Tracing: Jaeger, Zipkin, OpenTelemetry
  • Alerting: PagerDuty, Opsgenie
  • Model Monitoring: Arize, Fiddler, Evidently
Programming:
  • Python: Primary language for ML/AI
  • Libraries: NumPy, Pandas, PyTorch/TensorFlow, scikit-learn
  • FastAPI, Flask for serving
  • Go: For high-performance services and tooling
  • Shell Scripting: Bash, Python for automation
  • SQL: Advanced queries, optimization
AI-Assisted Operations:
  • Autonomous agents for incident response
  • AI-powered log analysis and anomaly detection
  • Automated root cause analysis
  • Intelligent alerting and noise reduction
Other Highly Desirable Skills:
  • Experience with LLM fine-tuning and deployment at scale
  • Background in data engineering or ML engineering
  • Startup or high-growth environment experience
  • Security certifications (CISSP, AWS Security)
  • Contributions to open source MLOps projects
  • Experience with multi-cloud or hybrid cloud
  • Prior software engineering experience
Success Metrics
  • Uptime: 99.9%+ availability for AI services
  • Deployment Frequency: Daily or on-demand deployments
  • Model Performance: Latency (p95
500ms), accuracy tracking
  • Cost Efficiency: Cost per inference, infrastructure utilization
  • Developer Velocity: Time to deploy new models, AI feature adoption rate
  • Incident Response: MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve)
#KATIM

Skills Required

Beware of fraud agents! do not pay money to get a job

MNCJobsGulf.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD2165752
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Abu Dhabi, United Arab Emirates
  • Education
    Not mentioned