Operations Engineer
The Technical Operations engineer acts as a subject matter expert, owning root cause analysis, automation design, and resiliency improvements across core services. Core competencies include Advanced Troubleshooting (performing deep diagnostics across infrastructure, applications, and integrations), Root Cause & Postmortem Ownership (leading RCAs and implementing permanent fixes), Automation & Scripting Proficiency (building workflows or tools that eliminate recurring manual effort), Observability Architecture (designing meaningful dashboards, alerting strategies, and health checks), and Continuous Optimization (proactively identifying performance bottlenecks and resiliency gaps). Success is measured by a visible reduction in incident recurrence, automation coverage of repetitive tasks (20–30%+), improved service uptime, and documented best practices that elevate lower support tiers.
NATURE OF WORK
- Work on shifting schedule (Morning & Mid, 12x4) to ensure 24/7 coverage.
- Act as the escalation point for high-severity and complex technical incidents.
- Perform deep diagnostics across infrastructure, databases, applications, APIs, and integrations
- Design and develop automation scripts, workflows, or tools to eliminate repetitive manual task.
- Integrate automation into operational processes, monitoring, and remediation workflows.
- Design dashboards, alerting strategies, and health checks that provide actionable insights.
- Reduce noise by improving signal-to-noise ratio in monitoring and alerting systems.
- Work with engineering teams to strengthen system design, redundancy, and self-healing mechanisms.
- Document best practices, troubleshooting guides, runbooks, and technical standards.
REQUIRED QUALIFICATIONS
- Bachelor’s degree in Computer Science or related field.
- 4+ years of experience in Technical Operations, Site Reliability Engineering / DevOps or similar roles
- Proficiency in scripting/automation (Python, Bash, JavaScript or similar)
- Solid understanding of monitoring platforms, log aggregation, traces, and metrics
- Experience with cloud platforms (AWS, GCP, Azure)
- Familiarity with automation frameworks, CI/CD pipelines, or configuration management tools.
- Exposure to observability solutions (Datadog, Splunk, Dynatrace, Prometheus, etc.).
- Experience tuning performance for high-availability or distributed systems.