Site Reliability Engineer | Remote (US Shift)

apartmentKMC Solutions placeManila scheduleFull-time calendar_month5/5/26

Site Reliability Engineer

Remote; Night Shift

Job Description:

The SRE will support product launches in production with a focus on AWS environments and leveraging our existing AWS Organizations. This role will concentrate on implementing and driving robust reliability practices for Infrastructure as Code (IaC), optimizing our cloud infrastructure, and overseeing the creation and configuration of new AWS accounts at an organizational level. The ideal candidate will shape cross-functional reliability strategies, ensure strict compliance and security, and promote high availability across multiple teams and services.

Key Responsibilities:

Reliability and Security Integration

Embed reliability and compliance best practices into all phases of the software development lifecycle, emphasizing shift left security principles.
Advocate for early vulnerability detection, resilient design, thorough testing, and automated rollbacks, and drive organizational adoption of these practices.

Automation & Infrastructure Management

Develop, maintain, and enhance automation tools and scripts (e.g., Terraform, Ansible) for enterprise-level cloud environments and AWS Organizations.
Ensure IaC scripts incorporate best practices for reliability, availability, regulatory compliance, and security checks using tools like Prisma Cloud or Wiz.
Provide technical leadership and mentorship to SRE and DevOps teams to elevate overall automation and reliability standards.

AWS Organizations & Account Management

Create and provision new AWS accounts within AWS Organizations using IaC at scale.
Automate the configuration of services, networking, and security settings across these accounts to ensure consistent, secure, and compliant environments.
Lead collaboration efforts across different teams to ensure organizational standards are met in all new and existing AWS accounts.

Monitoring & Incident Response

Implement and manage sophisticated monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, ELK) within our cloud contexts.
Respond promptly to incidents, conduct comprehensive root cause analyses, and drive post-mortem processes to prevent future outages, integrating reliability and security considerations.
Oversee and continuously improve on-call processes, ensuring efficient incident triaging and stakeholder communication.

Performance & Security Optimization

Analyze complex system metrics, perform deep performance tuning, and optimize resource utilization to enhance system reliability and efficiency.
Ensure performance and capacity measures align with business objectives and compliance mandates.
Collaborate with security teams to uphold rigorous security standards, recommending architectural and process improvements.

Disaster Recovery & DR Testing

Design, implement, and conduct thorough disaster recovery testing to minimize downtime and data loss.
Continuously refine strategies to meet or exceed compliance and security standards.

Collaboration

Work closely with cross-functional teams—including senior development, operations, security, and product leaders—to prioritize and address reliability, compliance, and security issues.
Provide technical mentorship, best practice guidelines, and thought leadership in reliability engineering across the organization.

Continuous Improvement

Stay updated with current with industry trends, tools, methodologies, and security practices, with a particular focus on AWS services, AWS Organizations, and advanced shift left strategies.
Champion initiatives that drive organizational change, fostering a culture of innovation and continuous improvement in reliability practices.

Qualifications:

At least 5-7 years experience similar to the role
Proven track record of leading complex reliability initiatives across multiple teams or business units.
Experience with regulated or high-security environments is a plus.
Proficiency in one or more programming/scripting languages (e.g., Python, Go, Ruby, Bash).
Strong familiarity with CI/CD pipelines, configuration management tools (e.g., Ansible, Puppet, Chef), and Infrastructure as Code (Terraform, CloudFormation) in AWS Organizations contexts.
Hands-on experience implementing shift left security practices, integrating security scanning tools into CI/CD pipelines.

local_fire_departmentUrgent

Site Reliability Engineer | Remote (US Shift)

apartmentKMC SolutionsplaceQuezon City, 10 km from Manila

Site Reliability Engineer Remote; Night Shift Job Description: • The SRE will support product launches in production with a focus on AWS environments and leveraging our existing AWS Organizations. This role will concentrate on implementing...

electric_boltImmediate start

Performance & Reliability Engineer

placeMandaluyong, 6 km from Manila

Supports the technology systems performance and reliability to meet service level targets. Assists with the creation and deploys continuous performance and capacity models using performance and availability monitoring tools, processes...

check_circleNew offer

Associate Engineer Reliability Engineering VIII

apartmentVertivplaceMandaluyong, 6 km from Manila

electrical system model using power systems software. Perform supervised electrical power system analysis, assist with Real-Time Data Collection (RTDC) tasks, conduct basic engineering research, and generate billable reports on low complexity electrical...

Best jobs you don't want to miss:

Engineer Jobs in Manila

Application Engineer Jobs in Manila

Cloud Engineer Jobs in Manila 7 Urgent

Building Engineer Jobs in Manila 7 Urgent

Architectural Engineer Jobs in Manila