Site Reliability Engineer

apartmentConverge ICT Solutions placePasig descriptionTemporary calendar_month 

Key Responsibilities

Production Operations & System Reliability
  • Maintain the reliability, availability, and performance of cloud-native, distributed production environments.
  • Write software and scripts to automate routine operational tasks, eliminating repetitive manual work ("toil").
  • Participate in system architectural reviews to ensure new features and services conform to scalability and reliability standards.
Observability & Telemetry Engineering
  • Build, maintain, and optimize the enterprise observability stack (metrics, logs, traces) to ensure comprehensive visibility into system health.
  • Create and refine real-time dashboards and proactive alerting rules to catch anomalies before they impact end-users.
  • Help define, track, and report on critical reliability metrics, including Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Incident Response & Blameless Post-Mortems
  • Participate in the team's on-call rotation, acting as a first responder to mitigate active production degradation and outages.
  • Lead and support technical troubleshooting efforts across application, network, and database layers during incidents.
  • Conduct thoroughly documented, blameless post-mortems to identify root causes and track long-term engineering fixes to prevent recurrence.
Infrastructure as Code & Deployment Automation
  • Provision, configure, and maintain multi-region cloud infrastructure exclusively using Infrastructure as Code (IaC).
  • Collaborate with DevOps and Platform teams to build and support safe, automated deployment pipelines (CI/CD) featuring canary or blue/green release patterns.
  • Ensure data safety by maintaining and regularly testing disaster recovery (DR), backup, and automated failover systems.
Capacity Planning & Chaos Engineering
  • Analyze system utilization trends and perform capacity planning to handle traffic surges efficiently and cost-effectively.
  • Introduce chaos engineering principles by intentionally injecting faults into controlled environments to uncover hidden system weaknesses.
  • Conduct load, stress, and performance testing on core infrastructure components to identify bottlenecks.
Collaboration & Knowledge Sharing
  • Work hand-in-hand with software engineering teams to help them architect their applications for high availability and fault tolerance.
  • Create and update high-quality documentation, operational runbooks, and architectural diagrams.
  • Champion an internal culture of engineering-driven operations and reliability best practices.

Job Requirements

Qualifications and Experience
  • Education: Bachelor’s degree in Computer Science, Computer Engineering, Information Technology, or a related field (or equivalent practical experience).
  • Experience: Minimum of 3–5 years of experience in an SRE, DevOps, or Software Engineering role supporting production environments.
  • Systems & Networking: Deep understanding of Linux/Unix systems internals, networking protocols, and performance tuning.
  • Software Development: Solid background in software development with the ability to write clean, maintainable code for automation and internal tooling.
  • Work Setup: Must be willing to work strictly On-site.
Skills and CompetenciesTechnical Competencies
  • Programming/Scripting: High proficiency in languages such as Python, Go, Bash, or Java.
  • Cloud Architecture: Hands-on experience with major cloud platforms (AWS, GCP, or Azure), specifically around networking, IAM, and managed services.
  • Containers & Orchestration: Strong operational knowledge of Docker and production Kubernetes (EKS, GKE, AKS, or self-managed).
  • Infrastructure as Code: Proficiency with Terraform, OpenTofu, or Pulumi.
  • Observability Tools: Experience configuring and querying tools like Prometheus, Grafana, OpenTelemetry, ELK Stack, Datadog, or New Relic.
  • Core Networking: Solid understanding of TCP/IP, HTTP/S, DNS, load balancers, and CDN caching layers.
Soft Skills
  • Analytical Mindset: Exceptional troubleshooting skills; able to methodically isolate failures in complex, distributed architectures under high-pressure scenarios.
  • Collaboration: Excellent communication skills with a proven track record of bridging the gap between product developers and infrastructure engineers.
  • Continuous Learning: Curiosity and eagerness to keep up with the evolving cloud-native and CNCF ecosystem.
Preferred Certifications
  • Kubernetes: Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD).
  • Cloud Platforms: AWS Certified Solutions Architect (Associate or Professional), AWS Certified DevOps Engineer, or Google Cloud Professional Cloud DevOps Engineer.
  • HashiCorp: Terraform Associate.
apartmentConverge ICT SolutionsplacePasig
and Experience  •  Education: Bachelor’s degree in Computer Science, Information Systems, Engineering, or equivalent practical experience.  •  Core Experience: Minimum of 3–5 years of professional experience in a DevOps, Platform Engineering, or Site Reliability...
local_fire_departmentUrgent

Telecommunications engineer

placeTaguig, 6 km from Pasig
Job Description Posted on 19 June 2026 Ka-Eastern Engineer for Service Delivery Engineer is responsible for ensuring high‑quality, timely, and efficient installation, isolation, and restoration of data circuits—including IPLC, Domestic Leased Lines...
apartmentUnion Bank Of The PhilippinesplacePasig
Job Description Automation Test Engineer Location: UnionBank Plaza, Ortigas, Pasig City Employment Type: Full-time About The Role The Automation Test Engineer responsible for creating and maintaining automated test scripts to validate...