Description:
Hybrid schedule (3 days on-site, 2 days remote) in Foster City
We’re currently looking for a Site Reliability Engineer who will be responsible for measuring and maintaining the uptime of our distributed build service critical to the development process for autonomous vehicles. In this role, you will be heavily involved in all of the operational tasks of keeping a critical service up and running and enhancing the service to make it easy to maintain and fault-tolerant through deployment, operation, and continual improvement. Our client is a robotics company, and our ethos of automation extends throughout the infrastructure components we build.
Responsibilities:
Keeping a large production service up and running, including host OS upgrades, docker image upgrades, SSL certificate upgrades
Defining and refining metrics to track service health and performance
Automating software releases and service failovers
Qualifications:
Bachelor’s degree in an engineering, mathematics, or related field and 2+ years of relevant experience
Experience supporting multiple production services
Experience utilizing tools like Ansible, Terraform, or Salt effectively
Ability to extract and report useful performance or service metrics
Linux, no matter the flavor
Familiarity with Python, bash scripting
6+ years of experience
Bonus Qualifications:
AWS, ECS, Kubernetes
Master’s degree in computer science or related degree
Benefits:
Health Care Plan (Medical, Dental & Vision)
Life Insurance (Basic, Voluntary & AD&D)
Paid Time Off (Vacation, Sick & Public Holidays)
Training & Development
Retirement Plan (401k, IRA)