Site Reliability Engineer
Lending & Credit Infra
Description:
NATURE OF WORK
- Implement structured engineering and operations processes to ensure system reliability, scalability, and performance through industry best practices and automation
- Drive automation-first environments by eliminating manual interventions, implementing Infrastructure-as-Code (IaC), and enhancing system observability
- Develop and maintain reusable infrastructure templates to simplify and standardize resource deployments, ensuring scalability, repeatability, and efficiency
- Manage and optimize allocated budgets (OPEX and CAPEX), balancing cost efficiency with system performance and reliability goals
- Ensure compliance and security by adhering to industry standards and frameworks such as Center for Internet Security (CIS), PCI-DSS Certification, and BSP Compliance, integrating security best practices into operational workflows
- Enhance delivery velocity and operational efficiency by streamlining processes, automating deployments, and driving a culture of continuous improvement
- Collaborate across teams to align reliability objectives with business goals, fostering strong partnerships between development, operations, and security teams
- Proactively monitor and optimize system performance, implementing observability solutions to detect and resolve issues before they impact customers
- Lead incident response and root cause analysis, ensuring production stability while continuously refining processes to minimize downtime and enhance resilience
REQUIRED QUALIFICATIONS
- Experience in Kubernetes administration orchestration: deploying, scaling and managing containerized applications in production environments
- Proficiency in CI/CD pipeline management, with hands-on experience in tools such as GitLab, ArgoCD, or similar
- Extensive experience with Infrastructure-as-Code (IaC) using Terraform to provision, manage and scale cloud infrastructure efficiently
- Familiarity with GitOps practices, ensuring declarative infrastructure and continuous deployment using tools like ArgoCD
- Experience with deployment strategies such as Blue/Green, Canary, Rolling, and Feature Toggles to manage risk and ensure smooth production rollouts
- Solid understanding of release management processes across multiple environments
- Hands-on experience with AWS cloud services, including but not limited to EC2, S3, RDS, Lambda, VPC, IAM and cost optimization
- Solid understanding of networking concepts, security best practices, and compliance requirements within cloud environments
- Knowledge of incident management and SLO/SLI definitions with focus on maintaining high availability and reliability of production systems
- Expertise in monitoring and observability, including the implementation of solutions and distributed tracing
- Experience with log management and analysis, leveraging tools such as Splunk, CloudWatch or Loki for troubleshooting and insights
- Strong problem-solving skills and the ability to diagnose complex system issues under pressure
- Excellent collaboration and communication skills to work effectively with cross-functional teams, including developers, operations, and security
- Experience with Agile and DevOps methodologies, ensuring continuous improvement and delivery of reliable systems