We are seeking an experienced and strategic Site Reliability Engineer (SRE) to drive the stability, reliability, and observability of our mission-critical systems. This role is crucial to ensuring high availability, performance, and operational excellence for our services. The SRE will be responsible for designing and implementing robust reliability frameworks, overseeing system monitoring, incident response, and leading key initiatives to improve system performance.
This role requires a strong leadership mindset, balancing proactive risk mitigation with rapid incident response. The ideal candidate will work closely with engineering, operations, and leadership teams to define and uphold service-level objectives (SLOs) and optimize system resilience.
Key Responsibilities & Objectives
- Develop and enforce service-level indicators (SLIs) and objectives (SLOs) to measure and improve system health.
- Implement and manage comprehensive observability strategies, ensuring real-time visibility into system performance, availability, and health.
- Oversee incident management and response processes, ensuring quick mitigation of production issues and leading post-mortem investigations to drive systemic improvements.
- Optimize system reliability through failure analysis, capacity planning, and proactive risk assessment.
- Define and implement best practices for on-call management, reducing alert fatigue while ensuring critical issues are addressed efficiently.
- Assist with writing RCAs by providing technical details of the incident
- Continuously refine operational runbooks, incident response plans, and system reliability guidelines to enhance organizational resilience.
- Analyze system performance trends, production issues, and historical outages to proactively address weaknesses before they impact customers.
- Drive cultural change within the organization, promoting a reliability-first mindset across all teams.
- Bachelor’s degree in Computer Science, Engineering, or a related field.
- 5+ years of experience in a Site Reliability Engineering, Production Engineering, or Systems Engineering role.
- Proven expertise in managing high-availability, distributed systems in a production environment.
- Deep understanding of observability practices, including monitoring, logging, and tracing with tools such as Prometheus, Grafana, Datadog, New Relic, or OpenTelemetry.
- Extensive experience in incident response, RCAs, post-mortems, and continuous improvement processes.
- Strong background in capacity planning, load balancing, and performance tuning for large-scale applications.
- Experience with operational leadership, on-call management, and defining reliability strategies within complex environments.
- Familiarity with networking, security best practices, and risk management strategies for distributed architectures.
- Strong analytical and problem-solving skills to diagnose system failures and implement long-term solutions.
Preferred Skill Set
- Incident Management & Alerting: Experience with Jira Service Management, PagerDuty, Opsgenie, or equivalent tools.
- Cloud Infrastructure Management: Hands-on expertise with AWS, GCP, or Azure.
- Database Performance Optimization: Experience working with relational and NoSQL databases
- Capacity Planning & Scalability Strategies: Ability to assess and predict infrastructure needs for growth.
- Technical Leadership & Communication: Proven ability to work cross-functionally and drive reliability initiatives at scale.