Site Reliability Engineer (keep things running!)
O'Reilly Auto Parts
The Site Reliability Engineer is responsible for the availability and performance the platforms and services of O’Reilly Auto Parts. Creates and defines monitoring and incident response tools and processes.
The Site Reliability Engineer will create a bridge between development and operations by applying a software engineering mindset to system administration. Time will be split between operations/on-call duties and developing systems and software that help increase site reliability and performance.
ESSENTIAL JOB FUNCTION:
- Deploy methodologies for building and operating highly available and scalable services.Work closely with Network Operations Center to develop monitoring tools, analyze root cause of incidents, and improve the Network Operations Center’s ability to independently resolve issues.
- Evaluate, build and modify automation for deploying and operating production services.
- Provide leadership in reducing and resolving production incidents.
- Proactively monitor and review application performance. Monitor specific metrics, set thresholds, and trigger alerts based on those thresholds.
- Collect and analyze logging and diagnostic information.
- Identify opportunities to improve all operations processes.
- Facilitate effective transition of services into production ensuring that all requirements have been met in accordance with O’Reilly’s Change Management standards.
- Properly document all incident responses.
- Provide updates and documentation to runbooks and operational manuals.
- Document mean time to recover (MTTR) and mean time to failure (MTTF).
- Participate in on-call rotations.
SKILLS/ EDUCATION/ KNOWLEDGE/ EXPERIENCE/ ABILITES:
- Bachelor’s Degree or equivalent work experience.
- 5+ years of professional experience in Site Reliability, Linux Systems Administration, DevOps, or Infrastructure Engineering.
- Experience with Shell Scripting such as Bash, Python or Ruby.
- Familiarity with automation and configuration management tools and frameworks.
- Excellent analytical and problem solving skills.
- Strong written and verbal communication skills.
- Must be well organized, detail oriented, and able to self-prioritize work.
- Must exhibit a high degree of professionalism.
- Composed urgency in stressful situations.
- ITIL Foundations Certification.
- CRE or CMRP Certifications.