Incident Manager

  • BC/DR

  • Remote

  • Contract

About the job:
Title: Incident Manager
Start Date: Immediate
Position Type: Contract
Location: Remote across North America
  
Description
In this incident management function, manage incidents to resolution in a 24/7/365 environment using the client incident management processes, effectively guide incident and triage calls from a technical perspective, share technical details obtained from monitoring tools and dashboards to aid troubleshooting, outline details of resolution activities, recommend and implement improved processes, provide timely status updates to stakeholders, assist with postmortem related activities and support various efforts related to operational improvements. Manage efforts to maintain application in production, including troubleshooting stoppages, repairing bugs, documenting application performance, and coordinating with technology infrastructure management.
  
KEY JOB FUNCTIONS

  • Manage IT production incidents to resolution in a 24/7/365 environment using the client's incident management processes and communicate management of incident status, impact and resolution actions.
  • Hands on experience managing and monitoring applications deployed on Amazon Web Services (AWS).
  • Troubleshooting and resolving incidents on the AWS cloud infrastructure.
  • Experience with building tools for monitoring and troubleshooting of system resources in an AWS environment. Ability to triage AWS related incidents using monitoring tools on AWS Cloud.
  • Experience with performance engineering of AWS Cloud applications.
  • Hands on experience working with AWS tools like EC2, ELB, RDS, Redshift, DynamoDB, Aurora, Route53, ECS, Lambada, S3, Batch, CloudWatch, CloudTrail, WAF etc.
  • Hands on experience with transaction level monitoring using Dynatrace and Splunk.
  • Ability to perform transaction level monitoring and troubleshooting in AWS cloud platform.
  • Eyes on glass monitoring of the health of applications as well as the underlying infrastructure.
  • Ability to analyze dashboards and reporting/monitoring tools to look at trends and patterns in application health and performance.
  • Proactively looking for hardware, software, and environmental alerts or malfunctions.
  • Effectively lead and guide Incident triage calls from a technical perspective analyzing different components of the infrastructure and application environment via the use of a variety of monitoring tools and processes.
  • Troubleshoot the incidents and identify root cause quickly using operations, wire data analytics, application performance management and event correlation monitoring tools.
  • Perform analysis of data, evaluating multiple application protocols including web, database, storage, and supporting infrastructure such as AWS, UNIX, DNS, LDAP, SSL, SMTP, and FTP.
  • Collaborate with technical teams and articulate troubleshooting steps effectively.
  • Participate in technical follow-up calls for critical incidents.
  • Assist with documentation of Root Cause Analysis (RCA) or Correction of Errors (COE) and data quality for all ECC communicated incidents.
  • Ensure appropriate functional and management escalation takes place as per the standards and procedures.
  • Follow up on items that could potentially negatively impact production operations, assist with postmortem related activities and support various efforts related to operational improvements.
  • Based on recommendations from management, implement new and improved processes, change processes, perform new tasks, create reports and address ad-hoc requests.
  • Participate in on-call rotation. Ability to work on any shifts as needed including weekends and night shifts.
  • Ability to report incident details and metrics to senior leadership.
  
EDUCATION
Bachelor's Degree or equivalent required.
  
MINIMUM EXPERIENCE
3+ years of related experience

Main Logo
Rocket