Director – Site Reliability Engineering (SRE)

at Fannie Mae
Location Reston
Date Posted August 29, 2021
Category Engineering
Job Type Full Time


Company Description

At Fannie Mae, futures are made. The inspiring work we do makes an affordable home a reality and a difference in the lives of Americans. Every day offers compelling opportunities to modernize the nation's housing finance system while being part of an inclusive team using new, emerging technologies. Here, you will help lead our industry forward, enhance your technical expertise, and make your career.

Job Description

In this compelling leadership position, you will scale up and mature a team of Site Reliability Engineers (SREs). You will be responsible for building and executing a vision to implement best practices and best-of-breed tooling for monitoring, deployments, and automated remediation's that development teams will rely on to build and run highly available, resilient services in the cloud. You will be tasked with evangelizing and socializing the Site Reliability Engineering discipline across the enterprise, serve as a change agent for driving service prioritization and help promote a culture of continuous improvement measured by operational metrics and KPIs. You will apply your expertise in software and systems engineering to ensure that our mission critical systems meet the appropriate performance needs of our users. In this role, you will be expected to strategize portfolio / program reliability by working with cross-functional IT organizations and build roadmaps to drive reliability into the product, enable the enterprise to standardize and adopt application reliability metrics and improve application health.


The Site Reliability Engineering (SRE) Director role will offer you the flexibility to make each day your own, while working alongside people who care so that you can deliver on the following responsibilities:

  • Collaborate with key stakeholders across Engineering, Architecture and InfoSec teams on initiatives and capabilities related to the operational health, security, growth, usability, and design of our applications.
  • Set strategy and develop roadmap for team aimed towards reducing the operational overhead of keeping Fannie Mae applications healthy, secure, and available for our customers.
  • Collaborate across domains to drive ownership of production systems, enable faster decision making and transparent observability into system health.
  • Drive service reliability by developing tooling that enables metric visibility using SLIs, SLOs, and SLAs.
  • Advocate for and drive the implementation of reliable design patterns.
  • Promote simplicity in solving complex problems across our technology footprint.
  • Lead and focus teams on root cause analysis, pattern identification and continuous improvement in order to optimize application performance, resiliency and reliability.


Required Experience:

  • 8-10+ years of relevant professional experience
  • Experience setting strategic vision for an enterprise wide practice or capability, communicating and selling the vision to leadership, stakeholders and the team
  • Exemplary leadership and communication abilities (both verbal and written) are a must; this role will partner closely with business and technology executives in a highly matrixed structure
  • Influencing skills to include negotiation, persuasion of others, meeting facilitation, and conflict resolution;
  • Experience as a leader managing other leaders/managers
  • Experience with hands on top level systems reliability engineering and providing senior level technical direction on enterprise level projects.
  • Experience collaborating cross-functionally on availability / performance issues in order to identify root-cause, determine areas for improvement, and drive those actions to closure through effective solutions;
  • Collective capabilities for leadership, including leading teams, giving feedback, facilitating meetings, coaching & mentoring, promoting collaboration and knowledge sharing.
  • experience managing teams
  • Adept at managing project plans, resources, and people to ensure successful project completion in a Agile / Scrum environment
  • Demonstrated experience leading engineering and operational teams responsible for supporting and deploying Enterprise scale cloud services and products
  • Proven track record of improving reliability, availability, incident management and performance of cloud services
  • Proven experience managing software development lifecycle platforms and tools and/or designing, building, servicing, and driving ongoing improvement of service infrastructure systems
  • Experience in activities like architecture reviews, code reviews, creating platforms and frameworks, capacity planning, etc.
  • Experience designing and developing highly available systems that utilize load balancing, horizontal scalability, and high availability;
  • Strong understanding and knowledge of Java / J2EE technologies and frameworks including UI / JavaScript frameworks, Spring Boot / Spring Cloud Frameworks, REST, Microservices, server-side frameworks;
  • Understanding of containerization concepts including Docker & Kubernetes.
  • Understanding of Continuous Delivery and Integration frameworks including deployment automation and configuration management components and familiarity with DevOps / CICD tools like Jenkins, Jules etc.
  • Familiarity with process automation practices and tools such as Blue Prism, Selenium, Ansible playbooks and Python or PowerShell scripting
  • Deep understanding of and experience in implementing resiliency design patterns frameworks and validations.
  • Experience in implementing Chaos Engineering concepts and familiarity with tools such Gremlin and Chaos Monkey
  • Experience defining, measuring, and improving Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Operations Processes (Incident, Problem Management), and Operations Toil Reduction through Automation;
  • Experience driving the development of dashboards from application and infrastructure health perspectives using tools such as Splunk, Dynatrace, Datadog, SignalFx etc. to provide a single pane view of all critical business and operational information to relevant stakeholders;

Desired Experience:

  • Bachelor degree or equivalent; Master degree preferred
  • Candidates located in or around the Reston, VA or Plano, TX preferred
  • Relevant certifications such as AWS Certified Solutions Architect, AWS Certified SysOps Administrator, Splunk Certified Developer, Dynatrace, Sun Certified Java Programmer, etc.

Additional Information

Job REF ID: REF5873A

The future is what you make it to be. Discover compelling opportunities at

Fannie Mae is an Equal Opportunity Employer, which means we are committed to fostering a diverse and inclusive workplace. All qualified applicants will receive consideration for employment without regard to race, religion, national origin, gender, gender identity, sexual orientation, personal appearance, protected veteran status, disability, age, or other legally protected status. For individuals with disabilities who would like to request an accommodation in the application process, email us at - provided by Dice