Site Reliability Engineer (DevOps)

GitLab via Stack Overflow
System Administration

Jun 4th 2018


Site Reliability Engineers are responsible for the keeping GitLab.com and many other GitLab production systems running smoothly 24/7/365. They're developers  specialising in systems, whether it be networking, or the Linux kernel, or even a specific interest in scaling, algorithms, or distributed systems. GitLab.com is a unique site and it brings unique challenges: it's the biggest  GitLab instance in existence; in fact, it's one of the largest single-tenancy  open-source SAAS sites on the internet. The experience of our production  engineers feeds back into other engineer groups within the company, as well as  to GitLab customers, running on-premise installations. Responsibilities: 

  • Be on a PagerDuty rotation to respond to GitLab.com availability incidents and 
  • provide support for service engineers with customer incidents.
  • Use your on-call shift to prevent incidents from ever happening.
  • Manage our infrastructure with Chef, Terraform and Kubernetes.
  • Make monitoring and alerting alert on symptoms and not on outages.
  • Document every action so your learnings turn into repeatable actions and then into automation.
  • Improve the deployment process to make it as boring as possible.
  • Design, build and maintain core infrastructure pieces that allow GitLab scaling to support hundred of thousands of concurrent users.
  • Debug production issues across services and levels of the stack.
  • Plan the growth of GitLab's infrastructure.

Requirements: 

  • Think about systems - edge cases, failure modes, behaviors, specific implementations.
  • Know your way around Linux and the Unix Shell.
  • Know what is the use of config management systems like Chef (the one we use)
  • Have strong programming skills - Ruby and/or Go
  • Have an urge to collaborate and communicate asynchronously.
  • Have an urge to document all the things so you don't need to learn the same thing twice.
  • Have a proactive, go-for-it attitude. When you see something broken, you can't help but fix it.
  • Have an urge for delivering quickly and iterating fast.
  • Share our values, and work in accordance with those values.

Projects you might work on: 

  • Coding infrastructure automation with Chef
  • Improving our Prometheus Monitoring or building new Metrics
  • Helping release managers deploy and troubleshoot new versions of GitLab-EE.
  • Migrate GitLab.com from it's current home on Azure Cloud to Google Cloud Platform.
  • Migrate GitLab.com to Kubernetes. 
Apply for this job