Debug School

Rajesh Kumar
Rajesh Kumar

Posted on

What is SRE? Lets share our knowlege!!!

Please find the following questions and share your inputs & Experience on each questions as below

  1. What is SRE?
  2. Why SRE is Important?
  3. How to transform from OPS To SRE?
  4. Difference between Ops vs SRE vs DevOps?
  5. Roles in SRE?
  6. Activities in SRE?
  7. Function in SRE?
  8. What is toil and how reduce it?
  9. What is SLI, SLA and SLO and how to manage it?
  10. What is incident and postmortam and how to do it?

What are the tools We should know to transform into SRE?

This is my list of tools as below

  • Many tools -
  • Webserver - Apache HTTP & Nginx
  • Multi-cluster Kubernetes orchestration platform - Rancher
  • Services mesh Data planes & Control Planes - Envoy & Istio
  • Network configurations and Service Discovery - Consul
  • Securing Credentials - HashiCorp Vault & AWS Secrets Manager, Azure key vault, AWS KMS, Kubernetes Secrets
  • Infrastructure Monitoring Tool - Prometheus with Grafana
  • Log Monitoring Tool - Elasticsearch Logstash Kibana(ELK stake)
  • Incident Response using PagerDuty & Opsgenie
  • Production Env Job scheduler and Run Book Automation - RunDeck
  • Application Performance Monitoring - Appdynamics

Top comments (2)

chandrasekharbeluduru_70 profile image

What is SRE?
SRE is result of transformation in the process of taking care of applications with a constant eye on their availability, latency, performance, and capacity. it is SRE Teams goal to safeguard, support, and advance the software and systems.

Why SRE is Important?
SRE enables teams to balance between delivering new features and maintaining stability of application to keep users happy.

How to transform from OPS To SRE?
Below are few actions to transform:
1. Practicing the Coding (Not app features related...code for automating process, write code to analyze the logs... etc)

2. Adopting to observability
3. Stop blame game when issues raised

Difference between Ops vs SRE vs DevOps?
Ops- it is a separate team to take care of maintaining the infrastructure and dev team handovers the complete responsibility of deploying and managing the code. On top of it infra level changes and monitoring is also taken care by ops team.

DevOps - is next level in the SDLC Cycle, to take care of continuous delivery of feature in faster manner where Dev and Ops teams work together during release.

SRE - SRE team plays critical role from development to deployment and always has authority to stop the release with valid reason. SRE Team has right to review the code, analyze any gaps and suggest on performance and stability improvement actions to be taken by dev teams. SRE to follow Blame Free role in entire SDLC.

Roles in SRE?
Collaboratively work with dev team.
investigating user incidents.
Analyzing the existing process and avoid toil
introduce automation to reduce toil wherever is required
Define SLI's and agree on SLA with business and set SLO's with internal teams to keep customers happy.

Activities in SRE?
Assume same answer as " How to transform from OPS To SRE?"

Function in SRE?
Define the error budgets to have breathing gap for expected downtime and planned activities.
Define SLA's and SLO to maintain the healthy and performing application.

What is toil and how reduce it?
Reiterative work is called toil.
Introduce automation to reduce the toil and it helps to concentrate in strategies planned.

What is SLI, SLA and SLO and how to manage it?
SLI is Service level indicators. These are indicators to evaluate/observe the applications stability(Errors/exceptions), performance and other infra structure health parameters like CPU,Memory utilization...etc.

What is incident and postmortam and how to do it?
Incidents are either raised by users on the functionality gaps or with respect to infra structure related issues by respective teams.
User issues needs to be analyzed by looking at the logs / debugging the code and fix the issues.
Infra related incidents needs to investigated and issue needs to be identified using the logs/APM tools and fix the issue.

devsaju profile image
Saju Dev

What is SRE?
SRE is a way of keeping a watchful eye on system's availability, latency, performance, and capacity, this engineering technique bridges the gap between developers and operations by providing an hand 2 hand method of work than a linear way.

Why SRE is Important?
It helps in system by being more proactive, measuring SLA's and avoiding issues which can come if only operation team.

How to transform from OPS To SRE?
By making operations team to work along with developers, bringing monitoring/toil removal methods. Blocking performance/stability issues at root than after deployment.

Difference between Ops vs SRE vs DevOps?
Ops team are more of reactive team, works only in production and handles issues after seeing it.

SRE & Devops are 2 sides of same coin but are different in the manner they operate.
SRE involves principles which focuses on removing toil, automating, monitoring and enhancing systems. On how ot make system more RELIABLE.
Devops involves principles related to rapid delivery of sw products and more towards how developers and operations can collaborate

Roles in SRE?
-Automating Toils.
-Building SW which helps ITOPS and Developers work better
-Incident management

Activities in SRE?
to check availability, performance, monitoring, incident response, preparation.

What is toil and how reduce it?
Repetitive manual automable tasks are Toils
Can be reduced by automating it

What is SLI, SLA and SLO and how to manage it?
SLI- Service Level Indicators-current system metrics
SLA- Service Level Agreement- Metrics agreed by business
SLO- Service Level Objectives - Internal goal within the team to avoid SLA

What is incident and postmortem and how to do it?
Incidents are support tickets raised by end users when system is having any issue.

It is resolved by support team by analyzing and working to fix or providing workaround to end users.