AWS Infrastructure Application Support
Location: Remote
About the job
- As a Level 3 AWS Infrastructure Support Engineer, you will own overnight monitoring and response for Electronikmedia’s Clients' AWS-based production environment. You will:
- Monitor system health using Datadog and AWS-native tools
- Investigate alerts and anomalies using established runbooks
- Resolve production incidents when possible
- Escalate complex issues quickly and accurately
- Maintain clean, auditable incident documentation
- This role is ideal for someone who thrives in high-trust, high-impact operational environments.
- Key ResponsibilitiesOn-Call & Incident Response
- Provide initial response within 15 minutes for all high-priority production alerts
- Investigate, mitigate, and resolve production outages when feasible
- Escalate unresolved or complex issues using the defined escalation matrix
- Act as the owner of the production system stability
- Monitoring, Alerting & Observability
- Analyze and respond to Datadog monitor alerts across infrastructure and application layers
- Identify abnormal patterns, trend-line deviations, and early indicators of systemic risk
- Proactively notify stakeholders of significant performance or stability concerns
- Contribute insights for preventive and corrective actions
- Root Cause & Trend Analysis
- Track recurring alerts and incidents
- Provide analysis and recommendations to reduce alert noise and improve system resilience
- Participate in weekly validation of Datadog alert configurations and thresholds
- Communication & Documentation
- Maintain clear, concise, and timely communication during incidents
- Document all incidents, alarms, and observations in Jira during each shift
- Ensure handoff notes are complete and actionable for daytime engineering teams
- Technical Environment Core AWS Services
- ECS (Fargate)
- RDS
- ElastiCache
- EC2
- Lambda
- API Gateway
- S3
- Tooling
-
- Datadog (monitoring, alerts, dashboards)
- Jira (incident tracking and documentation)
Qualifications & Experience
- 5+ years of hands-on AWS infrastructure administration and support
- Proven experience supporting production-grade, high-availability systems
- Strong background in incident response within enterprise or scale-up environments
Skills
- Deep operational knowledge of AWS services and distributed systems
- Strong troubleshooting and root-cause analysis skills under tight SLAs
- Ability to follow runbooks while also knowing when to think beyond them
- Calm, structured decision-making during production incidents
Certifications (Preferred)
- AWS Certified Solutions Architect – Associate or Professional
- AWS Certified DevOps Engineer – Professional (Nice to Have)
Service Level Expectations
- Alert Escalation SLA: ≤ 15 minutes for high-priority alarms
- Availability: Consistent overnight coverage ( IST Day Shift )
- Reliability: Zero missed critical alerts during assigned coverage windows
Deliverables
- Monthly Service Performance Report, including:
- Alerts monitored
- Incidents resolved
- Escalations
- SLA adherence metrics
- Weekly Datadog Validation, ensuring alert accuracy and functionality
Apply