Senior Site Reliability Engineer

Location San Francisco, California
Date Posted July 31, 2022
Category Engineering
Job Type Not Specified


The Team You ll Work With

The Site Reliability Engineering team (SRE) at Carta is responsible for ensuring the availability, reliability, and resiliency of the Carta app and other production systems in various environments. The team has expertise in systems architecture and design, infrastructure automation using Terraform, AWS and Kubernetes. In addition, the SRE team collaborates closely with the Information Security team on defining secure network boundaries and implementing security policies.

The Problems You ll Solve

  • Develop and maintain Terraform configs, Jenkins pipelines, Kubernetes manifest files as infrastructure as code (IaC) and extend these configurations to support new services, features and multiple environments.
  • Solve complex dependencies of critical services of various business units and build automation to prevent future problems. Develop automation scripts to streamline system upgrades and pipelines to improve deployment cycle.
  • Maximize and maintain high availability of systems and services while ensuring critical business functions are meeting their SLOs.
  • Influence new designs and architecture, best practices and standards in supporting and improving technology platforms.
  • Establish monitoring and alerting of production systems and critical applications.
  • Participate in our on-call rotation to resolve site incidents and document your findings into repeatable runbooks as part of improving site availability.
  • Work cross functionally with a passion to improve developer productivity.
  • About You

    You will be part of a cross functional team of engineers and product managers, and successful candidates will have extremely high EQ and IQ, with a strong bias towards collaboration. We re optimizing for strong senior engineers with at least 4+ years of relevant experience who are excited about the opportunities to work with a fast moving team, as well as previous experience working with:

  • Hosting distributed systems on a public cloud providers (GCP or AWS)
  • Containerization technologies (specifically, Docker, Kubernetes, Helm )
  • Building and working with scalable infrastructures using Linux and Docker containers
  • Automation via "infrastructure as code" (using tools like Terraform, Ansible, etc.) and writing scripts in Python and Bash
  • GitHub and advanced understanding of CI/CD tooling (Jenkins, CircleCI)
  • Production systems monitoring using tools such as Datadog, Grafana etc.
  • You ll build reliable infrastructure via code for the Carta app to run on Kubernetes serving sensitive financial data. You will provide performance metrics visibility into the systems and applications via Datadog monitoring. You will leverage your prior experience in designing, building and maintaining infrastructure with reliability as core principle to reduce service failures as it pertains to site performance and availability. You will lead by example to demonstrate team collaboration in timely execution of planned projects enabling swifter delivery of software. You are pragmatic in making tradeoffs between different designs to optimize overall business value and are passionate to elevate the team as part of sharing knowledge and teaching. You have a desire to understand and solve people s problems instead of simply fulfilling the requests.

    We are an equal opportunity employer and are committed to providing a positive interview experience for every candidate. If accommodations due to a disability or medical condition are needed, connect with us via email at . As a company, we value fairness, helpfulness, transparency, leadership and build our teams around these values. Check out our to get to know us better as you think about your next step at Carta.

    Drop files here browse files ...