to ensure the reliability, scalability, and performance of our production systems. The SRE will work closely with engineering, DevOps, and product teams to build highly available systems, automate operations, and improve system observability while maintaining service level objectives (SLOs).
Key Responsibilities
------------------------
Reliability & Operations
Ensure high availability, reliability, and performance of production systems.
Define, monitor, and manage
SLIs, SLOs, and SLAs
.
Lead incident response, root cause analysis (RCA), and post-incident reviews.
Implement proactive monitoring and alerting to prevent outages.
Automation & Engineering
Automate repetitive operational tasks using scripting and infrastructure-as-code.
Improve system reliability through engineering solutions rather than manual intervention.
Reduce toil by building tools, automation, and self-healing systems.
Cloud & Infrastructure
Design and manage scalable infrastructure on cloud platforms (AWS / Azure / GCP).
Manage containerized workloads using
Docker and Kubernetes
.
Implement and maintain CI/CD pipelines for safe and frequent deployments.
Monitoring & Observability
Build and maintain observability solutions using tools such as:
Prometheus, Grafana
ELK / OpenSearch
Datadog, New Relic
Track system performance, capacity planning, and error budgets.
Security & Compliance
Ensure reliability best practices aligned with security standards.
Participate in on-call rotations and ensure secure system operations.
Collaborate with security teams to implement secure infrastructure practices.
Required Skills & Qualifications
-------------------------------------
Bachelor's degree in Computer Science, Engineering, or related field.
Strong experience in
Linux/Unix system administration
.
Proficiency in at least one scripting or programming language:
Python, Go, Bash, or Java
Experience with
cloud platforms
(AWS / Azure / GCP).
Hands-on experience with
Kubernetes and container orchestration
.
Knowledge of networking fundamentals (TCP/IP, DNS, load balancing).
Experience with monitoring, alerting, and incident management.
Preferred / Nice-to-Have Skills
-------------------------------------
Experience implementing SRE best practices from Google SRE principles.
Knowledge of
Terraform, Ansible, or CloudFormation
.
Experience with
service mesh
(Istio, Linkerd).
Understanding of
chaos engineering
tools (Gremlin, Chaos Mesh).
* Experience in fintech, banking, or high-availability systems.
Beware of fraud agents! do not pay money to get a job
MNCJobsGulf.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.