Senior Site Reliability Engineer
SRE Center of Excellence (COE)
Description:
NATURE OF WORK
- Lead architectural design and implementation of fault-tolerant, self-healing infrastructure across cloud and hybrid environments
- Drive organization-wide automation initiatives, eliminating manual operations through advanced IaC and CI/CD frameworks
- Own technical program leadership for reliability initiatives spanning multiple teams and services
- Strategic management of OPEX and CAPEX budgets with cost optimization accountability
- Deep expertise in compliance frameworks (CIS, PCI-DSS, BSP) with ability to architect compliant solutions
- Establish and enforce cloud governance policies, account structures, and organizational standards across AWS/Azure/GCP environments
REQUIRED QUALIFICATIONS
- Expert-level proficiency in Kubernetes (CRDs, Operators, multi-tenancy, advanced scheduling)
- Advanced Terraform expertise (custom providers, module design, automated testing)
- Deep Service Mesh knowledge (Istio traffic management, circuit breaking, rate limiting, mTLS)
- Proven experience building Internal Developer Platforms (IDP) with self-service workflows
- Advanced GitLab CI/CD and GitOps implementation (ArgoCD/FluxCD, multi-project pipelines)
- Expert-level WAF, API Gateway (Kong, Apigee, AWS APIGW), and network security implementation
- Strong software development skills in Go, Python, or Java with ability to review code for reliability impact
- Experience leading technical programs and cross-functional reliability initiatives
- Deep understanding of observability platforms (Dynatrace, Prometheus, OpenTelemetry) with custom integration experience
- Proven track record architecting microservices with high-availability and resiliency patterns
- Experience implementing AWS Organizations, Control Tower, Service Control Policies, and multi-account governance frameworks
- Proficiency in cloud policy-as-code tools (AWS Config, OPA, Sentinel) and compliance automation
- Knowledge of cloud security standards (CIS Benchmarks, AWS Well-Architected Framework, Azure/GCP best practices)
- Advanced expertise in Dynatrace, Datadog, or Grafana for building enterprise observability solutions
- Experience implementing SLO-based alerting, error budgets, and burn rate monitoring using Prometheus, Grafana, or commercial APM tools
- Proficiency in distributed tracing (Jaeger, Zipkin, OpenTelemetry) and log aggregation (ELK, Loki)
- Ability to design custom metrics, synthetic monitoring, and real user monitoring (RUM) strategies