Site Reliability Engineer

Lending & Credit Infra

Description:

NATURE OF WORK

Implement structured engineering and operations processes to ensure system reliability, scalability, and performance through industry best practices and automation
Drive automation-first environments by eliminating manual interventions, implementing Infrastructure-as-Code (IaC), and enhancing system observability
Develop and maintain reusable infrastructure templates to simplify and standardize resource deployments, ensuring scalability, repeatability, and efficiency
Manage and optimize allocated budgets (OPEX and CAPEX), balancing cost efficiency with system performance and reliability goals
Ensure compliance and security by adhering to industry standards and frameworks such as Center for Internet Security (CIS), PCI-DSS Certification, and BSP Compliance, integrating security best practices into operational workflows
Enhance delivery velocity and operational efficiency by streamlining processes, automating deployments, and driving a culture of continuous improvement
Collaborate across teams to align reliability objectives with business goals, fostering strong partnerships between development, operations, and security teams
Proactively monitor and optimize system performance, implementing observability solutions to detect and resolve issues before they impact customers
Lead incident response and root cause analysis, ensuring production stability while continuously refining processes to minimize downtime and enhance resilience

REQUIRED QUALIFICATIONS

Experience in Kubernetes administration orchestration: deploying, scaling and managing containerized applications in production environments
Proficiency in CI/CD pipeline management, with hands-on experience in tools such as GitLab, ArgoCD, or similar
Extensive experience with Infrastructure-as-Code (IaC) using Terraform to provision, manage and scale cloud infrastructure efficiently
Familiarity with GitOps practices, ensuring declarative infrastructure and continuous deployment using tools like ArgoCD
Experience with deployment strategies such as Blue/Green, Canary, Rolling, and Feature Toggles to manage risk and ensure smooth production rollouts
Solid understanding of release management processes across multiple environments
Hands-on experience with AWS cloud services, including but not limited to EC2, S3, RDS, Lambda, VPC, IAM and cost optimization
Solid understanding of networking concepts, security best practices, and compliance requirements within cloud environments
Knowledge of incident management and SLO/SLI definitions with focus on maintaining high availability and reliability of production systems
Expertise in monitoring and observability, including the implementation of solutions and distributed tracing
Experience with log management and analysis, leveraging tools such as Splunk, CloudWatch or Loki for troubleshooting and insights
Strong problem-solving skills and the ability to diagnose complex system issues under pressure
Excellent collaboration and communication skills to work effectively with cross-functional teams, including developers, operations, and security
Experience with Agile and DevOps methodologies, ensuring continuous improvement and delivery of reliable systems