Hi. For those that don’t know me, my name is Amar Kalsi and I’m responsible for the Site Reliability Engineering (SRE) team in the Netcool org. I wanted to write a series of blog articles about our adoption of SRE.
Earlier this month I attended SRECon 2019. I met SREs from around the globe from a variety of companies. It was interesting to understand where people are in their SRE transformation. It got me thinking about our own Netcool team’s evolution towards SRE. What problems were we solving and why did we make the organisational choices we made?
The Google perspective
Google published two very interesting articles earlier this year:
- Do you have an SRE team yet? How to start and assess your journey
- How SRE teams are organized, and how to get started
The first helps determine sets of practices that all SRE teams should aspire to. These practices apply irrespective of team organisation or maturity level. The second delves into the specifics of team structures. It highlights the pros and cons of various approaches and some adoption guidelines.
The articles do not advocate a single, hard definition of how an SRE team must structure itself. It’s more like a pragmatic framework to help teams decide a structure best suited to the problems at hand.
The Netcool perspective
As I say, this got me thinking to our own history and SRE journey. Our Development teams had a strong track-record in developing high-quality, on-premise software. They had evolved from Waterfall to Agile and were now delivering cloud-native offerings. In 2017 we started to manage Cloud Event Management as a public cloud offering. So what was our starting point? What were our goals back then?
The development org looked like this:
- Large (hundreds) of developers
- Multiple squads producing multiple applications
- A mix of on-prem and cloud native development
- Loosely coupled, tightly integrated software
- Geographically dispersed in multiple time zones
This led us to the following thought process:
- We knew that we couldn’t simply react and restore
- We knew there were lessons to learn from operational experience
- We wanted a continuous feedback loop into our application Development teams
- We wanted improved reliability, operability, scalability and performance of our own software
- We wanted to minimise operational toil
- We wanted to maximise bandwidth for developing and delivering value to our customers
SRE to the rescue
A traditional IT Operations model would not help us with toil reduction. So SRE to the rescue! But how best to apply it to our business model? We faced the following dilemma:
- Should we adopt a purist DevOps definition of “you build it, you run it“? We’d obviously have to replicate this in multiple squads. There’s a couple of risks with this approach. It might prove challenging to maintain standards across so many diverse teams. And there may be duplication of effort.
- Or should we adopt a looser definition of DevOps and centralise the operations function? This runs the risk that teams deprioritise operability in favour of feature development.
To help us make the best choice for our business model, we asked one more question. Which portions of our DevOps pipeline has the most risk?
We conducted an analysis of the existing Continuous Integration and Delivery pipeline. We discovered there was not much standardisation. Especially configuration management, deployment process and environment management between squads. There was also scarce networking experience in the teams.
We concluded the SRE team would have the most impact in acting as a centralised function. We set a goal of spending 50% of our time developing solutions for those areas listed above. This would free the application squads to concentrate on feature development.
In this capacity our SRE team worked in close collaboration with Development. Across teams we standardised pipelines and developed a secure, automated, common deployment process. We also developed solutions to integrate with downstream services as necessary.
Referencing that second Google article above, we created a hybrid organisational structure spanning the infrastructure and tools focused teams. There was also a loosely-coupled implementation of the embedded structure. So SREs would drift into application scrum teams as needed. Sometimes this was proactive. For example to help drive operability measures into code before delivery to production. At other times it was the result of an incident post-mortem analysis. The aim here would be to drive continuous improvement.
Our SRE team had now completed the scoped tasks and our focus began to shift into a more proactive mode. Problem identification and development became the focus.
Iterating on SRE
Two big areas were demanding our attention. We wanted to improve upon what we had. And we needed to react to market changes.
What were the improvements we could make? Three areas came to mind:
- How could we integrate more of our own monitoring software into the pipeline
- How could we further improve the DevOps feedback loop
- Which emerging technologies could we utilise
But what market changes were in play? Change is inevitable. As the market evolves, software developers must adapt to meet market needs. Our teams were beginning to have a greater focus on hybrid cloud. And they had begun to pivot to deliver content on private cloud.
We started to re-analyse requirements and shift emphasis back towards infrastructure and tooling. We felt this is where we could provide most value in current projects. But we needed to scale out what we had already done.
SRE is a specialist skill and mindset. Not all developers have operations experience and not all operations engineers are developers. In the absence of being able to recruit a specialist team of SREs, how do we curate an SRE team from existing skill sets? How did we leverage our own management products to help with skills transition?
This is the subject of my next post. How we drove improvements whilst tackling the skills gap.
In the meantime, I’d love to hear from you. Which models have your SRE teams adopted and evolved to?