Senior Site Reliability Engineer
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run production systems. SRE ensures that August 99’s services—both our internally critical and our externally-visible systems, e.g. GitLab/developer tooling and hosted client sites for Agent Image—have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.
General Responsibilities:
Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation, and refinement.
Install new / rebuild existing servers and configure hardware, services, settings, directories, storage, etc., following standards and project/operational requirements.
Perform daily system monitoring, verifying the integrity and availability of all hardware, server resources, systems, and key processes, reviewing system and application logs, and verifying completion of scheduled jobs such as backups.
Perform regular security monitoring to identify any possible intrusions.
Regularly work on improving August 99’s security practices, including: a. Recommending new technologies to improve threat assessment and mitigation. b. Assisting in the migration to new technologies. c. Assisting coworkers with infosec best practices to ensure cross-coverage within the team
Practice sustainable incident response and blameless postmortems.
Perform ongoing performance tuning and resource optimization as required.
Apply OS patches and upgrades on a regular basis, and upgrade administrative tools and utilities. Configure / add new services as necessary.
Develop and maintain installation and configuration procedures, especially related to automation.
Design, develop, troubleshoot, and debug software programs for databases, applications, tools, networks, etc.
As a member of the site reliability and IT team, you will assist in defining and developing software for tasks associated with the development, debugging, or designing of software applications or operating systems.
Provide technical leadership to other software developers.
Specify, design, and implement modest changes to existing software architecture to meet changing needs.
Analyze system and software security and change procedures or code when necessary.
Stay informed about new and relevant CVEs, potential bugs, viruses/worms/etc, and how to take preventive or corrective measures for each.
Duties and tasks are varied and complex ,needing independent judgment. Candidates should be fully competent in own areas of expertise. May have project lead role and or supervise lower level personnel. BS or MS degree or equivalent experience relevant to functional area.5 years of software engineering or related experience.
A degree in computer science or a similar IT degree program.
Working experience with multiple POSIX operating systems (e.g., CentOS, Ubuntu, macOS).
Advanced knowledge of at least one server-grade GNU/Linux distribution (e.g,. CentOS, Ubuntu).
Advanced knowledge of database optimization and SQL queries (specifically MySQL/MariaDB).
Good scripting skills using POSIX scripting toolkits (bash, sed, awk, python, perl, etc). Knowledge of general-purpose programming languages such as PHP, C, C++, and Java is a plus.
Expertise/advanced knowledge with WordPress setup and configuration.
Demonstrated experience working with monitoring and analytics tools (e.g., Sysdig, Papertrail, Nagios, Cacti, Splunk).
Knowledge of best practices in regards to security/encryption and service configuration (SSL/TLS, SFTP, password management, access restrictions, firewalls, ports, etc.).
Basic knowledge of AWS, Rackspace, or Google Cloud services and tools.
RHCSA/RHCSE, or a comparable certification, is required
Willing to work on a night shift schedule