Building an SLO-Driven Culture at Salesforce

CRM software company Salesforce has unveiled its service reliability approach using service level indicators and objectives (SLI and SLO). After building a platform to monitor SLOs, they saw massive adoption with 1,200 services integrated in the first year. The platform provides service owners with deep and actionable insights on how to improve or maintain the health of their services, find dips in SLIs, find dependent services that were not meeting their own SLOs and , overall, to better understand customers. experience with their services.

Creating a platform to monitor service reliability eliminates organizational complexities and labor, allowing teams to focus on creating business value. Tripty Sheth discusses how crucial it was for Salesforce to agree on a definition of “highly trusted” across a range of technology stacks, and across the many individual products and services and support products within the organization. This allowed them to define reliability in terms of SLI and SLO.

As documented by Google Cloud, Site Reliability Engineering (SRE) starts from the idea that availability is a prerequisite for success. Service Level Objectives (SLOs) are a precise numerical target for service availability. A Service Level Agreement (SLA) defines a promise to a service user that the SLO will be met over a specific period of time, and Service Level Indicators (SLIs) are direct measures of service performance. These generally accepted definitions are often used to show customer experience in a clear, quantitative, and actionable way.

In the past, Salesforce teams had assembled SLOs manually, which meant that updating these metrics and reporting on them was a time-consuming and error-prone task. Additionally, different teams were calculating and storing these values ​​in different ways, preventing the company from having a clear picture of the customer experience.

Forming a standardized view of service availability was crucial, and Salesforce addressed it in three areas:

Standardized measurements: Salesforce used a previously established SLO framework based on five readings of request rate, errors, availability, duration/latency, and saturation (BED) to define a standardized measure of the health of products and services.

Standardized tools: a dedicated SLO platform to house definitions of SLIs, SLOs, and services, including ownership, health thresholds, and alert configurations. This metadata is maintained in a single data store, with long-term storage and retention to provide visibility into historical health trends. Automated alerts can be set up based on collected data.

Standardized visualization: as soon as a new service is added to the platform, a standard out-of-the-box view of metrics is generated, with standard SLI READS and any custom SLIs added for that specific service. Visualization includes a dedicated screen grafana dashboard for real-time monitoring that is automatically generated and fed with real-time data. Additionally, the service is added to the Service Analytics Dashboard which is regularly reviewed to spark conversations about service health and availability.

The combination of these three areas creates many benefits:

  • Confidence that SLOs are calculated in a standardized way
  • Information from SLI and SLO metrics visualized
  • Using granular SLO targets to determine if a service is meeting expectations
  • Alert on SLI and SLO metrics
  • Correlation of Violations with Incidents
  • Identifying service dependencies

The SLO platform architecture includes several components. It is centered around a service registry and configuration store – holding service ownership information, service statuses and service-specific configuration, as well as data on SLIs, SLOs and thresholds required to trigger alerts. On the periphery of this are data stores for change and version information, collected for future use in correlating changes with SLO violations, as well as a platform for monitoring time series and pipelines for collecting and aggregating metrics.

The unified service health dashboard has become a focal point for operational reviews. The team used these metrics to trigger architectural reviews and stimulate discussions around strategic investments and tactical improvements.

Future work will enable a more comprehensive view of a service’s dependencies – with the goal of pinpointing exactly where a failure is occurring and minimizing recovery times. Additionally, after collecting this data per service and with a realistic view of its dependent service, Salesforce will be able to set realistic SLIs across the stack.

The full article with more details is available on medium.

Helen D. Jessen