Apply now »

Site Reliability Engineer

Lending & Credit Infra
Description: 

NATURE OF WORK

  • Implement structured engineering and operations processes to ensure system reliability, scalability, and performance through industry best practices and automation
  • Drive automation-first environments by eliminating manual interventions, implementing Infrastructure-as-Code (IaC), and enhancing system observability
  • Develop and maintain reusable infrastructure templates to simplify and standardize resource deployments, ensuring scalability, repeatability, and efficiency
  • Manage and optimize allocated budgets (OPEX and CAPEX), balancing cost efficiency with system performance and reliability goals
  • Ensure compliance and security by adhering to industry standards and frameworks such as Center for Internet Security (CIS), PCI-DSS Certification, and BSP Compliance, integrating security best practices into operational workflows
  • Enhance delivery velocity and operational efficiency by streamlining processes, automating deployments, and driving a culture of continuous improvement
  • Collaborate across teams to align reliability objectives with business goals, fostering strong partnerships between development, operations, and security teams
  • Proactively monitor and optimize system performance, implementing observability solutions to detect and resolve issues before they impact customers
  • Lead incident response and root cause analysis, ensuring production stability while continuously refining processes to minimize downtime and enhance resilience

 

REQUIRED QUALIFICATIONS

  • Experience in Kubernetes administration orchestration: deploying, scaling and managing containerized applications in production environments
  • Proficiency in CI/CD pipeline management, with hands-on experience in tools such as GitLab, ArgoCD, or similar
  • Extensive experience with Infrastructure-as-Code (IaC) using Terraform to provision, manage and scale cloud infrastructure efficiently
  • Familiarity with GitOps practices, ensuring declarative infrastructure and continuous deployment using tools like ArgoCD
  • Experience with deployment strategies such as Blue/Green, Canary, Rolling, and Feature Toggles to manage risk and ensure smooth production rollouts
  • Solid understanding of release management processes across multiple environments
  • Hands-on experience with AWS cloud services, including but not limited to EC2, S3, RDS, Lambda, VPC, IAM and cost optimization
  • Solid understanding of networking concepts, security best practices, and compliance requirements within cloud environments
  • Knowledge of incident management and SLO/SLI definitions with focus on maintaining high availability and reliability of production systems
  • Expertise in monitoring and observability, including the implementation of solutions and distributed tracing
  • Experience with log management and analysis, leveraging tools such as Splunk, CloudWatch or Loki for troubleshooting and insights
  • Strong problem-solving skills and the ability to diagnose complex system issues under pressure
  • Excellent collaboration and communication skills to work effectively with cross-functional teams, including developers, operations, and security
  • Experience with Agile and DevOps methodologies, ensuring continuous improvement and delivery of reliable systems

Apply now »