Site Reliability Engineer | Remote (US Shift)
KMC Solutions Manila Full-time
Site Reliability Engineer
Remote; Night Shift
Job Description:
- The SRE will support product launches in production with a focus on AWS environments and leveraging our existing AWS Organizations. This role will concentrate on implementing and driving robust reliability practices for Infrastructure as Code (IaC), optimizing our cloud infrastructure, and overseeing the creation and configuration of new AWS accounts at an organizational level. The ideal candidate will shape cross-functional reliability strategies, ensure strict compliance and security, and promote high availability across multiple teams and services.
Key Responsibilities:
Reliability and Security Integration- Embed reliability and compliance best practices into all phases of the software development lifecycle, emphasizing shift left security principles.
- Advocate for early vulnerability detection, resilient design, thorough testing, and automated rollbacks, and drive organizational adoption of these practices.
- Develop, maintain, and enhance automation tools and scripts (e.g., Terraform, Ansible) for enterprise-level cloud environments and AWS Organizations.
- Ensure IaC scripts incorporate best practices for reliability, availability, regulatory compliance, and security checks using tools like Prisma Cloud or Wiz.
- Provide technical leadership and mentorship to SRE and DevOps teams to elevate overall automation and reliability standards.
- Create and provision new AWS accounts within AWS Organizations using IaC at scale.
- Automate the configuration of services, networking, and security settings across these accounts to ensure consistent, secure, and compliant environments.
- Lead collaboration efforts across different teams to ensure organizational standards are met in all new and existing AWS accounts.
- Implement and manage sophisticated monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, ELK) within our cloud contexts.
- Respond promptly to incidents, conduct comprehensive root cause analyses, and drive post-mortem processes to prevent future outages, integrating reliability and security considerations.
- Oversee and continuously improve on-call processes, ensuring efficient incident triaging and stakeholder communication.
- Analyze complex system metrics, perform deep performance tuning, and optimize resource utilization to enhance system reliability and efficiency.
- Ensure performance and capacity measures align with business objectives and compliance mandates.
- Collaborate with security teams to uphold rigorous security standards, recommending architectural and process improvements.
- Design, implement, and conduct thorough disaster recovery testing to minimize downtime and data loss.
- Continuously refine strategies to meet or exceed compliance and security standards.
- Work closely with cross-functional teams—including senior development, operations, security, and product leaders—to prioritize and address reliability, compliance, and security issues.
- Provide technical mentorship, best practice guidelines, and thought leadership in reliability engineering across the organization.
- Stay updated with current with industry trends, tools, methodologies, and security practices, with a particular focus on AWS services, AWS Organizations, and advanced shift left strategies.
- Champion initiatives that drive organizational change, fostering a culture of innovation and continuous improvement in reliability practices.
Qualifications:
- At least 5-7 years experience similar to the role
- Proven track record of leading complex reliability initiatives across multiple teams or business units.
- Experience with regulated or high-security environments is a plus.
- Proficiency in one or more programming/scripting languages (e.g., Python, Go, Ruby, Bash).
- Strong familiarity with CI/CD pipelines, configuration management tools (e.g., Ansible, Puppet, Chef), and Infrastructure as Code (Terraform, CloudFormation) in AWS Organizations contexts.
- Hands-on experience implementing shift left security practices, integrating security scanning tools into CI/CD pipelines.
KMC SolutionsQuezon City, 10 km from Manila
Site Reliability Engineer
Remote; Night Shift
Job Description:
• The SRE will support product launches in production with a focus on AWS environments and leveraging our existing AWS Organizations. This role will concentrate on implementing...
Mandaluyong, 6 km from Manila
Supports the technology systems performance and reliability to meet service level targets. Assists with the creation and deploys continuous performance and capacity models using performance and availability monitoring tools, processes...
VertivMandaluyong, 6 km from Manila
electrical system model using power systems software. Perform supervised electrical power system analysis, assist with Real-Time Data Collection (RTDC) tasks, conduct basic engineering research, and generate billable reports on low complexity electrical...