I compiled this thread on Twitter, and all of a sudden, it got quite some attention. So here, I’ll try to elaborate on the topic a bit more. Maybe it would be helpful for someone trying to make a career decision or just improve general understanding of the most hyped titles in the industry.
While DevOps is all about what aspect of the matters, SRE talks about the how part of it all. Nevertheless, there are a few other differences between the two.
I’ve found Dickerson’s Hierarchy of Reliability to be a useful model for achieving reliability in your engineering endeavours. The pyramid below shows the different stages and aspects of software delivery that need to be addressed.
The idea behind the SRE pyramid is that we can categorize the health of a service in a similar way to how Abraham Maslow categorized human needs. The most basic elements required by a service are at the bottom of the pyramid, and the elements get more advanced as we move further up the pyramid.
The purpose of software reliability metrics is to get rid of bugs in the program so you don’t have a failing product. Without reliability metrics, it would be extremely hard to identify where exactly the issue is and how to solve it.
SLAs, SLOs and SLIs are fundamental to site reliability engineering (SRE), but what are they and why are they important for delivering services?
Chaos testing is a way to test the integrity of a system. Its purpose is to simulate failures that could crash a production system in a controlled environment. This helps to identify failures before they cause unplanned downtime that disrupts the user experience.
Toil is seemingly unavoidable for any team that manages a production service. System maintenance inevitably demands a certain amount of rollouts, upgrades, restarts, alert triage, and so forth. These activities can quickly consume a team if left unchecked and unaccounted for.
The techniques for alerting on significant events range from alerting when your error rate goes above your SLO threshold to using multiple levels of burn rate and window sizes. In most cases, we believe that the multiwindow, multi-burn-rate alerting technique is the most appropriate approach to defending your application’s SLOs.
Toil is a term coined by Google which describes the repetitive and tedious tasks associated with running a production service. Toil tends to be manual and devoid of any long-term value.
Reliability engineering can be applied across the entire lifecycle of software development. It is designed to increase the dependability of a product by detecting potential reliability issues early in the software development cycle, and correcting causes of failure that do occur.
An error budget is this margin of error that the customer is informed about beforehand to secure tolerance during system failure for a decided number of hours. The error budget is a critical requirement since it protects the service provider from inevitable system failures that are unforeseen and can rarely be mitigated during system design.
While the adoption of machine learning in DevOps is relatively slow compared to other industries, the potential is huge. To start understanding what has to gain from this rapidly developing field, one needs only to look at the world of monitoring and log analysis
Most will know what each of these items is, hereafter referred to as tenets, so let’s focus on what they mean for observability and what you should be thinking to reach a higher state of observability.
This is the idea behind proactive monitoring – the switching of context from “reactive” monitoring to something that allows you to act before the problem arises. Here are some guidelines to help you get started with your customized solution
Prometheus and Grafana are two monitoring tools that, in combination, provide all of the information DevOps and Dev teams need to build and maintain applications. Prometheus collects many types of metrics from almost every variety of service written in any development language, and open source grafana effectively queries, visualizes, and processes these metrics.
Even with years of experience, every new gig expects you to answer different kinds of DevOps interview questions. You’re in luck: Logz.io can help a bit on logging and network performance monitoring interview questions.
This blog post will pit Grafana vs Graphite two of the most popular observability tools on the market today. R&D organizations typically implement a wide technology stack. They include varying services, systems, or tools to support their production and development environments.
The internet is huge. Help us find great content!
We design learning experiences that help you grow.