Senior Specialist, Site Reliability Engineering
We are seeking a highly motivated and experienced Senior Associate to join the Shared Site Reliability Engineering (SRE) team supporting Risk Intelligence Services within the Markets and Risk Intelligence division. This role is essential to maintaining uninterrupted business operations across multiple applications while adhering to defined SLAs.
The successful candidate will be responsible for day-to-day operational support, collaborating closely with product SMEs to ensure smooth change deliveries, and contributing to observability improvements across the environment. As a subject matter expert for Risk Intelligence applications, the candidate will play a key role in incident resolution, system reliability, and mentoring junior team members in domain and process expertise.
Flexibility is required, including availability during out-of-office hours and public holidays, to support critical production services and on-call responsibilities. The ideal candidate will demonstrate strong ownership, technical depth, and a proactive approach to problem-solving in high-pressure situations.
Key Responsibilities- Ensure uninterrupted business operations by managing production support activities in alignment with defined SLAs
- Take ownership of incident calls and lead resolution efforts until SMEs are engaged
- Collaborate with product SMEs to ensure smooth and timely change deliveries
- Participate in out-of-hours and on-call support, including overnight monitoring and weekend release activities
- Contribute to observability analysis and drive improvements in monitoring, alerting, and telemetry
- Maintain and enhance support documentation and runbooks for supported applications
- Act as a subject matter expert for Risk Intelligence applications, providing deep technical and domain knowledge
- Mentor junior team members to build domain and process expertise
- Support continuous improvement initiatives across the SRE function
- Communicate effectively with stakeholders and provide timely updates on incidents and operational status
- Bachelor’s degree or equivalent, preferably in a technical discipline
- Experience with Linux (Amazon Linux AMI) and Windows Server 2019 in cloud environments
- Proficient in MySQL, PostgreSQL, MongoDB, and Aurora RDS
- Familiarity with AWS DocumentDB, DynamoDB, and SQLite
- Knowledge of MS SQL Always On Availability Groups and migration to Azure SQL Managed Instances
- Hands-on experience with AWS SQS and AWS SES
- Exposure to Amazon MSK, Coviant, and Cerberus
- Strong understanding of AWS S3 and EFS, including frontend integration
- Experience with Synapse Analytics and D365
- Skilled in development using Spring Boot, Node.js, Python (Django, Flask, Apache Airflow), Java (Java 11, Lambdas), React, Angular, JavaScript, C# (.NET Framework), and PHP
- Proficient in containerization and orchestration using Docker, Amazon ECS, EKS, and EC2
- 5+ years in production operations, SRE, or DevOps roles
- Strong understanding of incident management and operational support in complex environments
- Experience working in investment banking or financial services is preferred
- Excellent analytical and problem-solving skills
- Effective communicator with technical and business stakeholders
- Self-motivated with strong prioritization and ownership skills
- Collaborative mindset with a focus on mentoring and knowledge sharing