AWS Infrastructure Application Support

Location: Remote  

About the job

  1. As a Level 3 AWS Infrastructure Support Engineer, you will own overnight monitoring and response for Electronikmedia’s Clients' AWS-based production environment. You will:
    • Monitor system health using Datadog and AWS-native tools
    • Investigate alerts and anomalies using established runbooks
    • Resolve production incidents when possible
    • Escalate complex issues quickly and accurately
    • Maintain clean, auditable incident documentation
    • This role is ideal for someone who thrives in high-trust, high-impact operational environments.
  2. Key ResponsibilitiesOn-Call & Incident Response
    • Provide initial response within 15 minutes for all high-priority production alerts
    • Investigate, mitigate, and resolve production outages when feasible
    • Escalate unresolved or complex issues using the defined escalation matrix
    • Act as the owner of the production system stability
  3. Monitoring, Alerting & Observability
    • Analyze and respond to Datadog monitor alerts across infrastructure and application layers
    • Identify abnormal patterns, trend-line deviations, and early indicators of systemic risk
    • Proactively notify stakeholders of significant performance or stability concerns
    • Contribute insights for preventive and corrective actions
  4. Root Cause & Trend Analysis
    • Track recurring alerts and incidents
    • Provide analysis and recommendations to reduce alert noise and improve system resilience
    • Participate in weekly validation of Datadog alert configurations and thresholds
  5. Communication & Documentation
    • Maintain clear, concise, and timely communication during incidents
    • Document all incidents, alarms, and observations in Jira during each shift
    • Ensure handoff notes are complete and actionable for daytime engineering teams
  6. Technical Environment Core AWS Services
    • ECS (Fargate)
    • RDS
    • ElastiCache
    • EC2
    • Lambda
    • API Gateway
    • S3
  7. Tooling
      • Datadog (monitoring, alerts, dashboards)
      • Jira (incident tracking and documentation)

Qualifications & Experience

  • 5+ years of hands-on AWS infrastructure administration and support
  • Proven experience supporting production-grade, high-availability systems
  • Strong background in incident response within enterprise or scale-up environments

Skills

  • Deep operational knowledge of AWS services and distributed systems
  • Strong troubleshooting and root-cause analysis skills under tight SLAs
  • Ability to follow runbooks while also knowing when to think beyond them
  • Calm, structured decision-making during production incidents

Certifications (Preferred)

  • AWS Certified Solutions Architect – Associate or Professional
  • AWS Certified DevOps Engineer – Professional (Nice to Have)

Service Level Expectations

  • Alert Escalation SLA: ≤ 15 minutes for high-priority alarms
  • Availability: Consistent overnight coverage ( IST Day Shift )
  • Reliability: Zero missed critical alerts during assigned coverage windows

Deliverables

  • Monthly Service Performance Report, including:
  • Alerts monitored
  • Incidents resolved
  • Escalations
  • SLA adherence metrics
  • Weekly Datadog Validation, ensuring alert accuracy and functionality
Apply