Site Reliability Engineer - Managed Kubernetes - AWS/Azure
Rackspace via We Work Remotely
Jul 5th 2018
Headquarters: San Antonio, Texas
- Design, architect, as well as maintain existing operational solutions for managing our customer environments and infrastructure, across data centers and technologies with the specific goal of increasing the automation, repeatability, and consistency of operational tasks.
- Implement and maintain monitoring and alerting solutions that help discover failures in a timely fashion while working with engineers to identify root cause and fix issues
- Provide basic to intermediate network administration and troubleshooting.
- Day-to-day operational management, including response, incident, event and problem management activities along with our service delivery and engineering teams.
- Participate in on-call rotation duties.
- Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
- Support services & deployments before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Experience in one or more of the following: Microsoft Azure, Amazon Web Services
- Kubernetes and Docker/container runtimes is a must.
- Experience in one or more of the following: Python, Go, and cross platform scripting is a must.
- Experience with algorithms, data structures, complexity analysis and software design.
- Experience with Linux systems administration and tuning.
- Experience with automation tools such as Docker, Jenkins, Ansible, Terraform
- Understand and have implemented containerized systems.
- Comfort with collaboration, open communication and remote teams.
- Interest in designing, analyzing and troubleshooting large-scale distributed systems.
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
- Ability to debug and optimize code and automate routine tasks.
- Think of infrastructure and automation as code and critical engineering tasks.
To apply: https://rackspace.jobs/remote-tx/site-reliability-engineer-managed-kubernetes-awsazure-remote/44743E4863614CB3BADFB5360BEB2F75/job/?utm_campaign=c1e0402ed4-EMAIL_CAMPAIGN_2018_June&utm_medium=email&utm_source=RackUSA%2B_%2Bupdated%2B04.24.18&utm_term=0_468f4acf2e-c1e0402ed4-