Site Reliability Engineer
DevOps/SecOps
India
Permanent / Full Time
JD : Site Reliability Engineer (SRE)
Intuitive is an "Engineering Company" delivering measurable value and key business outcomes. Intuitive is one of the fastest-growing companies in the Americas (recognized by CRN & INC 5000) focused on IT solutions and services supporting 130+ Enterprises globally. With the reputation of being a Tiger Team & a Trusted Partner, Intuitive’s solution-centric SMEs, across its core Superpowers help solve the most complex challenges and initiatives for Enterprises.
Intuitive is one of the top global proserve partners for AWS, GCP, Azure, PANW, Zscaler, Databricks, Snowflake and it’s venture capital innovation portfolio.
Superpowers:
- Migration & Modernization
- DC to Cloud Migration, Cloud Eng., Cloud Native,
- AppModernization, DevSecOps/SRE
- FinOps
- Data & AI/ML
- Database Modernization
- Data (Cloud Native + DataBricks + Snowflake)
- Machine Learning, AI/GenAI
- Cybersecurity
- Application + Data + Infrastructure Security
- GRC, MRA/IRM/Remediation
Job Description: We are seeking an experienced Site Reliability Engineer (SRE) to enhance operational efficiency, reliability, and observability across infrastructure and application landscapes. This role focuses on integrating advanced monitoring platforms, defining key performance metrics, and establishing comprehensive monitoring solutions to ensure system health and performance. The SRE will work closely with cross-functional teams to implement alerting mechanisms, improve scalability, and drive the adoption of best practices in observability and reliability engineering.
Roles and Responsibilities:
- Observability Platform Integration
- Lead the transition to modern monitoring platforms, ensuring seamless integration with existing systems.
- Define and implement observability strategies to enhance visibility into infrastructure and applications.
- Collaborate with stakeholders to identify critical workloads and performance metrics.
- Monitoring and Alerting
- Develop and implement monitoring solutions for applications, databases, and infrastructure, capturing metrics such as availability, performance, and resource utilization.
- Establish alerting frameworks to detect anomalies, performance bottlenecks, and security incidents.
- Integrate monitoring and alerting with ITSM tools for streamlined incident management.
- Performance Metrics and SLAs
- Define and track Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) for key business systems.
- Work with stakeholders to align metrics with business expectations and operational goals.
- Automation and Scalability
- Leverage scripting and automation tools to streamline deployment of monitoring agents and configuration updates.
- Optimize monitoring platforms for scalability and efficiency, ensuring they can accommodate evolving business needs.
- Dashboard Development
- Design and maintain dashboards to provide real-time insights into system performance and health.
- Ensure dashboards are intuitive and actionable, enabling teams to monitor critical metrics effectively.
- Cloud Infrastructure Performance
- Deep understanding of cloud infrastructure and services
- Diagnose, troubleshoot, and optimize performance issues in cloud services, including compute, storage, and networking components.
- Implement monitoring and tuning practices specific to cloud-native environments to ensure reliability and scalability.
- Documentation and Training
- Develop comprehensive documentation for monitoring tools, configurations, and processes.
- Conduct training sessions to ensure teams are proficient in utilizing observability platforms and interpreting metrics.
- Continuous Improvement
- Continuously evaluate and enhance monitoring and observability solutions to meet changing organizational needs.
- Incorporate feedback from stakeholders to refine alerting thresholds, dashboards, and metrics.
Mandatory Skills:
- Performance Monitoring:
- Expertise with modern observability platforms – Sumo Logic
- Experience with Azure native monitoring solutions and practices
- Deep understanding of Azure infrastructure and services, including diagnosing and tuning performance issues with such services.
- Strong knowledge of monitoring methodologies for infrastructure, applications, and databases.
- Experience in monitoring/integrating observability platforms with Active Directory Domain Controllers, PeopleSoft Applications and Order Entry Systems (KPS).
- Experience with log management, metric collection, and alerting configuration.
- Ability to define and track SLAs, SLOs, and SLIs for business-critical systems.
- Experience in monitoring network, application, and database performance metrics.
- Strong understanding of network and security device monitoring, including SNMP, syslog, and NetFlow.
- Hands-on experience in application performance monitoring for enterprise platforms like ERP or custom applications.
- Familiarity with containerized environments and Kubernetes monitoring.
- Automation Skills:
- Experience with scripting languages (e.g., Python, Bash, PowerShell) to automate monitoring setup and management.
- Familiarity with infrastructure automation tools like Ansible and Terraform.
- Communication and Collaboration:
- Strong collaboration skills to work with cross-functional teams and stakeholders.
- Ability to communicate technical concepts to both technical and non-technical audiences.
- Incident Management:
- Familiarity with ITSM tools (e.g., ServiceNow) for incident and problem management.
- Proven experience in integrating alerting mechanisms with incident management workflows.