Operations Engineer
The technical operations engineer is responsible for ensuring reliable, resilient, and efficient IT operations by combining IT Service Management practices with advanced process automation, observability, and AI-driven monitoring. This role acts as the first line of defense for system reliability — detecting anomalies, resolving incidents within SLA, and preventing outages through proactive monitoring and automation.
The engineer leverages modern ITSM frameworks, automation tools, and observability platforms to reduce manual overhead, improve service quality, and enable predictive operations. Success in this role is measured by improved system uptime, faster incident resolution (MTTR), reduced repetitive workloads, and the ability to collaborate effectively with business and technology teams.
Nature of Work:
- Work on shifting schedule (Morning & Mid, 12x4) to ensure 24/7 coverage.
- Provide IT Service Management (incident, problem, change, request) and triage support.
- Drive process automation initiatives to reduce manual tasks and increase operational efficiency.
- Develop and maintain custom service checks to detect anomalies, performance degradation, and failures across systems and services.
- Drive observability practices by building dashboards, alerts, and KPIs for proactive monitoring.
- Apply AI-driven monitoring and analytics to predict and prevent incidents.
- Analyze system gaps, address recurring issues, and improve resiliency and reliability of IT services.
- Collaborate effectively with cross-functional business and technical teams.
Displayed Skill Mastery:
- IT Service Management frameworks (ITIL, Jira Service Management, ServiceNow).
- Observability platforms (Splunk, Datadog, Dynatrace, Prometheus, Grafana).
- Security and compliance standards (PCI DSS, ISO 27001, IT Security fundamentals).
- Automation & scripting (Python, Ansible, Bash, JavaScript, PowerShell).
- Cloud infrastructure management (AWS, Azure, GCP) and hybrid environments.
- AI/ML-driven monitoring and automation tools (AIOps platforms).
- Strong analytical, problem-solving, and cross-team collaboration skills.
Qualifications:
- Bachelor’s degree in Computer Science or related field.
- 2+ years of experience in IT Operations, Service Management, or Site Reliability Engineering.
- Hands-on experience with Splunk & observability platforms is an advantage.
- At least 1 year of cloud infrastructure management (AWS, Azure, or GCP).
- At least 1 year of experience in automation and scripting (Python, Ansible, JavaScript, or similar).
- Strong knowledge of Linux systems, IT security principles, and compliance standards (PCI DSS, ISO).
- Exposure to AIOps / AI-based monitoring tools is an advantage.