: Application, Microservices, and Infrastructure Observability EngineerOverall Objectives
Ensure comprehensive, end-to-end visibility into the health, performance, and reliability of
applications, microservices, and infrastructure
across on-premise and cloud environments.
Implement and manage modern
observability tools
to support real-time insights, distributed tracing, and predictive analytics for early issue detection and resolution.
Drive incident prevention, reduce
Mean Time to Resolution (MTTR)
, and enhance system resilience through data-driven monitoring, automated alerts, and root cause analysis.
Collaborate with
DevOps, Development, and Infrastructure
teams to foster a performance-centric culture in high-transaction environments.
Role-Specific Responsibilities
Design, implement, and maintain observability solutions across applications, microservices, and infrastructure using tools such as
Prometheus, Grafana, Dynatrace, and OpenTelemetry
.
Leverage
telemetry data (logs, metrics, traces)
to identify and troubleshoot issues across compute, network, storage, and application layers.
Enable
distributed tracing
and
service mapping
to diagnose performance bottlenecks and inter-service dependencies in microservices architectures.
Support
performance engineering
by optimizing code-level performance, transaction processing, and infrastructure scalability during peak loads or major releases.
Define and implement
automated remediation triggers
and escalation paths to minimize manual intervention and improve incident response times.
General Functional Responsibilities
Ensure compliance with
enterprise standards
and regulatory frameworks (e.g.,
GDPR, PSD2
) for monitoring and data collection.
Collaborate with
infrastructure, application, and security teams
to enhance data ingestion, correlation, and observability maturity (progressing from reactive to predictive monitoring).
Participate in
post-incident reviews
and performance retrospectives to identify trends, reduce MTTR, and improve overall reliability.
Provide
out-of-hours support (L1/L2)
for critical incidents as part of a rotating on-call schedule.
Required Skills & Qualifications
Strong expertise in
observability platforms
:
Prometheus, Grafana, Dynatrace, OpenTelemetry
, ELK/EFK Stack.
Proficiency in
cloud platforms
:
AWS, Azure, or GCP
, including cloud-native monitoring services.
Hands-on experience with
Kubernetes
,
Docker
, and containerized microservices environments.
Solid understanding of
CI/CD pipelines
(Jenkins, GitLab CI, GitHub Actions, Azure DevOps).
Strong knowledge of
infrastructure monitoring
(compute, storage, network) and
application performance monitoring (APM)
.
Familiarity with
scripting and automation
: Python, Bash, PowerShell, or Go.
Experience with
incident management tools
(PagerDuty, Opsgenie, ServiceNow) and alerting frameworks.
Good understanding of
ITIL processes
, incident response, and
root cause analysis
.
Strong communication and collaboration skills to work effectively with cross-functional teams.
Bachelor's degree in Computer Science, Engineering, or related field (or equivalent practical experience).
?