Principal Engineer Ml Ops

Abu Dhabi, United Arab Emirates

https://www.mncjobsgulf.com/company/edge-group

Apply Now

Job Description

About KATIM
KATIM is a leader in the development of innovative secure communication products and solutions for governments and businesses. As part of the Space & Cyber Technologies cluster at EDGE, one of the world's leading advanced technology groups, KATIM delivers trust in a world where cyber risks are a constant threat, and fulfils the increasing demand for advanced cyber capabilities by delivering robust, secure, end-to-end solutions centered on four core business units: Networks, Ultra Secure Mobile Devices, Applications, and Satellite Communications. Our talented team of cross functional experts continually takes-on new challenges. We work with the energy of a start-up yet the discipline of a large business to make solutions and products work for our customers at scale.
Job Purpose (specific to this role)
The Principal MLOps Engineer / AI Infrastructure Architect defines and leads the design, automation, and governance of KATIM's AI infrastructure powering mission-critical, secure communications products. This role drives end-to-end MLOps strategy - from model governance and deployment automation to compliance enforcement - ensuring every AI capability adheres to zero-trust and sovereign-data principles. This role bridges applied machine learning, software engineering, and DevSecOps - ensuring that AI models, pipelines, and applications are deployed, monitored, and scaled securely across environments.
You will operate in an AI-augmented product development environment, where infrastructure provisioning, model training, deployment, and monitoring are automated or AI-assisted. Your mission is to build systems that empower product teams to deliver production-grade AI capabilities with the speed, reliability, and compliance standards required by government and enterprise clients.
You will collaborate closely with cross-functional teams across platform engineering, product security, and product management to design MLOps pipelines that embody resilience, traceability, and zero-trust security principles. This role is ideal for engineers passionate about building intelligent automation frameworks, scalable AI pipelines, and secure data infrastructures that form the foundation of mission-critical AI systems.
AI-Augmented Product Development Model (Context for the Role)
We are a lean, AI-augmented product development team delivering production-ready, enterprise-grade secure applications with the quality and velocity typically requiring 3-4x larger organizations. Every role in the traditional SDLC is either AI-augmented or automated, allowing teams to focus on design, architecture, and high-impact decision-making.
Core Principles

Security is integrated into every decision, from architecture to deployment.
Repetitive tasks are automated; human effort is focused on strategy and problem-solving.
Quality is measurable, enforced, and automated at every stage.
All system behaviors-including AI-assisted outputs-must be traceable, reviewable, and explainable. We do not ship "black box" functionality.
Design and implementation decisions prioritize resilience, maintainability, and interoperability in real customer environments.

Key Responsiblities
AI MLOps Architecture & Governance (30%)

Define the MLOps architecture and governance framework across products.
Design secure, scalable AI platform blueprints covering data, training, serving and monitoring layers.
Standardize model registries, artifact signing, and deployment processes for air-gapped and on-prem environments.
Lead architectural designs and reviews for AI pipelines.
Design and maintain LLM inference infrastructure
Manage model registries and versioning (MLflow, Weights & Biases)
Implement model serving solutions (TensorFlow Serving, TorchServe, vLLM)
Optimize model performance and cost (quantization, caching, batching)
Build and maintain vector databases (Pinecone, Weaviate, Chroma)
Hardware and inference optimization awareness

Agent & Tool Development (25%)

Develop and maintain AI agents for automation (e.g., automated test generation, log intelligence, and anomaly detection).
Build AI-assisted DevSecOps utilities to automatically enforce compliance, logging, and audit policies.
Build tool integrations for LLM agents (function calling, APIs)
Implement retrieval-augmented generation (RAG) pipelines
Create prompt management and versioning systems
Monitor and optimize agent performance

CI/CT/CD Pipelines (20%)

Build continuous integration pipelines for models and code
Implement continuous training (CT) workflows
Automate model deployment with rollback capabilities
Create staging and production deployment strategies
Integrate AI-assisted code review into CI/CD
Building a continuous evaluation loop

Infrastructure & Automation (15%)

Manage cloud infrastructure (Kubernetes, serverless)
Implement Infrastructure as Code (Terraform, Pulumi)
Build monitoring and observability systems (Prometheus, Grafana, DataDog)
Automate operational tasks with AI agents
Ensure security and compliance (OWASP, SOC2) - AI-specific security

Developer Enablement (10%)

Provide tools and libraries for engineers to adopt AI-augmented workflows securely.
Document AI/ML best practices and patterns
Conduct training on MLOps tools and workflows
Support engineers with AI integration challenges
Maintain development environment parity
AI Privacy, Governance, and Compliance

Education and Minimum Qualification

BS in Computer Science, Computer Engineering, Mathematics or related field, or equivalent experience in industry / Masters preferred.

8+ years in DevOps, SRE, or platform engineering
5+ years hands-on experience with ML/AI systems in production
Deep understanding of LLMs and their operational requirements
Experience building and maintaining CI/CD pipelines
Strong Linux/Unix systems knowledge
Cloud platform expertise (AWS, GCP, or Azure)
Experience with container orchestration (Kubernetes)

Key Skills
MLOps & AI:

LLM Integration: OpenAI API, Anthropic API, HuggingFace, Azure OpenAI
Model Serving: TensorFlow Serving, TorchServe, vLLM, Ollama
Experiment Tracking: MLflow, Weights & Biases, Neptune.ai
Model Registries: MLflow, Kubeflow, AWS SageMaker
Vector Databases: Pinecone, Weaviate, Chroma, Milvus
Agent Frameworks: LangChain, LlamaIndex, AutoGPT, Semantic Kernel
Fine-tuning: LoRA, QLoRA, prompt tuning

Data Engineering:

Pipelines: Airflow, Prefect, Dagster
Processing: Spark, Dask, Ray
Streaming: Kafka, Pulsar, Kinesis
Data Quality: Great Expectations, dbt
Feature Stores: Feast, Tecton

DevOps & Infrastructure:

Containers: Docker, Kubernetes, Helm
Cloud Platforms: AWS (SageMaker, Lambda, ECS) OR GCP (Vertex AI, Cloud Run) OR Azure (ML Studio)
IaC: Terraform, Pulumi, CloudFormation
CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
Orchestration: Kubernetes operators, Kubeflow

Monitoring & Observability:

Metrics: Prometheus, Grafana, CloudWatch
Logging: ELK Stack, Loki, CloudWatch Logs
Tracing: Jaeger, Zipkin, OpenTelemetry
Alerting: PagerDuty, Opsgenie
Model Monitoring: Arize, Fiddler, Evidently

Programming:

Python: Primary language for ML/AI
Libraries: NumPy, Pandas, PyTorch/TensorFlow, scikit-learn
FastAPI, Flask for serving
Go: For high-performance services and tooling
Shell Scripting: Bash, Python for automation
SQL: Advanced queries, optimization

AI-Assisted Operations:

Autonomous agents for incident response
AI-powered log analysis and anomaly detection
Automated root cause analysis
Intelligent alerting and noise reduction

Other Highly Desirable Skills:

Experience with LLM fine-tuning and deployment at scale
Background in data engineering or ML engineering
Startup or high-growth environment experience
Security certifications (CISSP, AWS Security)
Contributions to open source MLOps projects
Experience with multi-cloud or hybrid cloud
Prior software engineering experience

Success Metrics

Uptime: 99.9%+ availability for AI services
Deployment Frequency: Daily or on-demand deployments
Model Performance: Latency (p95

500ms), accuracy tracking

Cost Efficiency: Cost per inference, infrastructure utilization
Developer Velocity: Time to deploy new models, AI feature adoption rate
Incident Response: MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve)

#KATIM

Skills Required

Architecture

Beware of fraud agents! do not pay money to get a job

MNCJobsGulf.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.

Related Jobs

C

BIM ARCHITECTURE MODELOR

CADD Emirates

United Arab Emirates

Apply Now
Concept Design Architect Architecture / Design Studio

The Greater Change

Dubai

Apply Now

Finishing Engineer (Architecture / Civil Engineer)

Reeqwest

Dubai

Apply Now
Senior Manager Architecture and Masterplanning | Majid Al Futtaim Properties | Design Studio Dubai, UAE

Majid Al Futtaim

Dubai

Apply Now

Job Detail

Job Id

JD2165752
Industry

Not mentioned
Total Positions

1
Job Type:

Full Time
Salary:

Not mentioned
Employment Status

Permanent
Job Location

Abu Dhabi, United Arab Emirates
Education

Not mentioned

MNC Jobs Gulf

Jobs by Function

Popular Job Skills

Popular Industries

Popular Cities

Jobseekers

Employers

Principal Engineer Ml Ops

Job Description

Skills Required

Related Jobs

BIM ARCHITECTURE MODELOR

Concept Design Architect Architecture / Design Studio

Finishing Engineer (Architecture / Civil Engineer)

Senior Manager Architecture and Masterplanning | Majid Al Futtaim Properties | Design Studio Dubai, UAE