Site Reliability Engineer

apartmentUnion Bank of the Philippines placePasig calendar_month 

An SRE is crucial for maintaining and improving software system efficiency through deployment automation and system optimization, ensuring consistent performance and reliability.

Ideal Candidate: Strong problem-solver eager to implement scalable, sustainable tech solutions.

Key Projects:

Infrastructure Scalability: Design and implement scalable, highly available systems for increasing loads.
CI/CD Pipelines: Create and optimize pipelines to automate testing and deployment.

Disaster Recovery: Develop and test plans for data integrity and system resilience.

Objectives:

Run production, monitoring availability and system health.

Build software and systems for platform infrastructure and applications.

Improve software reliability, quality, and time-to-market.

Measure and optimize system performance for future needs and innovation.

Provide primary operational support for large-scale distributed applications.

Responsibilities:

Gather and analyze system/application metrics for performance tuning and troubleshooting.

Partner with development for service improvements via testing and release procedures.

Participate in system design, platform management, and capacity planning.

Create sustainable systems through automation.

Balance feature speed and reliability with SLOs.

Monitor system performance and optimize pipelines.

Implement service metrics for reliability, performance, and efficiency.

Develop and maintain CI/CD pipelines.

Automate tasks and create tools for team efficiency.

Collaborate with development to integrate operational considerations.

Conduct post-incident reviews.

Contribute to disaster recovery plans.

Develop and support full-stack applications.

Collaborate on system infrastructure.

Increase system resilience and handle larger volumes.

Improve automation and self-healing.

Collect OS data and report performance metrics.

Manage cloud and database maintenance, debug production issues.

Design and implement highly available and scalable systems.

Collaborate on SLOs and SLAs.

Proactively monitor and resolve performance/availability issues.

Develop and maintain monitoring tools and dashboards.

Conduct post-incident analyses and implement preventive measures.

Create and maintain system documentation.

Perform capacity planning.

Collaborate on implementing new features with reliability in mind.

Stay updated on SRE best practices.

Required Skills & Qualifications:

Bachelor’s in Computer Science or related.

Programming skills (Python, Java, C/C++, Ruby, JavaScript).

Experience with distributed storage (NFS, HDFS, Ceph, S3) and resource management (Mesos, Kubernetes, Yarn).

Familiarity with DevOps and CI/CD toolchains.

Cloud, networking, or systems administration certifications.

Proactive problem identification and improvement skills.

Soft Skills:

Communication (technical and non-technical).

Problem-solving (effective, long-term solutions under pressure).

Adaptability (evolving technologies and needs).

Hard Skills:

Systems architecture (scalable and reliable infrastructure).

Networking and security (protocols, best practices, secure solutions).

Cloud platforms (AWS, GCP, Azure).

Technical Skills:

Scripting (Python, Bash) and coding (Go, Java).

Containerization (Docker, Kubernetes).

Networking fundamentals (TCP/IP, HTTP, DNS, load balancing, firewalls).

Strong Linux/Unix and command-line skills.

Experience with configuration management (Ansible, Puppet, Chef).

Knowledge of monitoring/logging tools (Prometheus, Grafana, ELK, Splunk).

Strong problem-solving and troubleshooting.

Excellent communication and collaboration.

Attention to detail and ability to work in a fast-paced environment.

apartmentRazer Inc.placePhilippines
Job Responsibilities : We are seeking a skilled and driven Senior Site Reliability Engineer (SRE) to join our growing infrastructure and platform engineering team. The ideal candidate will have hands-on experience in Amazon Web Services (AWS), strong...
placeManila, 11 km from Pasig
Job Description: We are looking for a Technical Service Engineer to join our growing team. The ideal candidate is an inquisitive, mechanically-inclined individual who enjoys solving problems in hydraulic and lubrication systems. This role involves...
apartmentAugust 99, IncplacePasig
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run production systems. SRE ensures that August 99’s services—both our internally critical and our externally-visible systems...