Sr. Product Reliability Engineer (100% Remote)

at Precision Resources Company Inc
Location Dallas, Texas
Date Posted August 2, 2022
Category Engineering


Location(s): Remote, US

  • Product Observability: PREs embed with development teams to build and refine: metrics, logging, monitoring, and alerting for our products to enable proactive (and increasingly automated) issue identification, prevention, and remediation. This serves to improve the performance and uptime of our products, as well as provide detailed telemetry that aids in debugging complex issues
  • Product Reliability Systems and Process: PREs are the amongst the most seasoned users of our deployment and stability infrastructure, which means they often identify opportunities for additional functionality that would improve operational efficiency. An important part of the PRE role is partnering with other infrastructure and operational teams to provide this input and, where appropriate, directly deliver features that will benefit product operations. (e.g., developing a system that automatically de-duplicates product alerts and enables teams to prioritize and document critical information).
  • Responsible for deploying, automating, maintaining, troubleshooting and improving the systems that keep the backend infrastructure running smoothly.
  • Responsible for the application maintenance, availability, and performance.
  • Dive deep to resolve problems at their root and troubleshoot services related to our platform
  • Develop automation tools for managing Aerial on-prem and cloud infrastructure.
  • Improve engineering standards, tooling, and processes
  • Develop a deep understanding of Aerial products and processes.
  • Collaborate with customer-facing, product, and infrastructure teams on the development and deployment of scalable, reliable software for our customers.
  • Diagnose, resolve, and prevent issues encountered in the field
  • Reduce the operational overhead of products and leverage data to understand the largest sources of reliability risk.
  • Deliver end-to-end improvements to stability by proactively preventing issues via telemetry and automation and directly reducing the need for reactive support.
  • Make data-driven decisions about investments in stability and reliability.
  • Take part in a 24/7 on-call rotation responsible for coordinating response to mission-critical incidents, ensuring efficient resolution with minimal customer impact.
  • Automating infrastructure and services in Aerial's Cloud
  • Working with client support to solve technical issues, perform upgrades, and migrate to better/faster/stronger gear
  • Supporting legacy and new architecture components to maximize uptime, availability, security
  • Working support tickets as part of a high output team in a fast-paced environment
  • Managing and optimize build pipelines and CI/CD tooling to streamline new product launches/features
  • Ensuring continuous delivery of technical and support services with end users and monitoring of systems and software performance
  • Providing technical input and leadership to formulate long-term objectives and standards of performance for staff, technical releases, production support and projects


  • You should have 6-8 years of experience with a start-up mentality in managing & troubleshooting large-scale distributed systems.
  • Excellent Linux and troubleshooting skills
  • You are an expert in Python/Bash and you are proficient in Linux.
  • Strong experience working in AWS/GCP environment and other server virtualization technologies
  • Experience working with monitoring stack like Splunk
  • Bachelor's degree in computer science
  • Familiarity with Infrastructure provisioning tools like Kubernetes, Terraform
  • Familiarity with Release Orchestration tools such as Octopus Deploy
  • Knowledge of CI/CD tools such as Gitlab
  • Comfortable building/proving new services for improvement/replacement of existing tech
  • Professional technical certifications from leading industry vendors such as AWS, GCP
  • Expertise in Python, SQL
  • Experience with distributed systems and technologies
  • Proven ability to deliver production software, including the use of Gitlab, Jenkins and/or other CI/CD tools
  • Technical expertise in industry leading products such as AWS, GCP, Terraform, Splunk, Gitlab
  • Expertise in at least one programming language used to building/launch cloud-based infrastructure
  • Technical knowledge including cloud services virtualization, containerization, postgres/mysql, and security
  • Comfortable building/proving new services for improvement/replacement of existing tech
  • Knowledge of RESTful APIs.
  • Remove ambiguity in understanding things by documenting things and hence making the teams more efficient and effective
  • Convert tacit knowledge to implicit knowledge
  • Experience in US Healthcare

Drop files here browse files ...