Torbz - Fotolia


Get familiar with these SLA maintenance tips

The basics of SLA management come at the intersection of business goals and the reality of possibility. IT teams must learn how to strike a balance between specific and conditional promises.

Static service-level agreements no longer suit service obligations as IT environments continue to grow in complexity. Fixed performance measures make little sense across a hybrid platform. As applications and IT infrastructure become more dynamic, IT organizations looking to create a contract need to learn to flow with changes -- so just what should be written into stone for proper SLA maintenance?

Set and manage SLA expectations

Work with business leaders to balance goals with reality during SLA maintenance -- or better yet, during its creation.

Business leaders often have impractical desires for technology: immediate response with zero latency, 100% uptime and unbreakable security. Ensure that the business understands their viable options. For example, if the current availability level has been maintained at 99%, explain what 1% downtime represents -- what is the mix of planned and unplanned time spent correcting errors or pushing live updates? Define what actions the organization can take to further minimize those offline periods, such as extra system redundancy or a migration to a container-based, hosted cloud platform, upon which new instances of applications or functions can be spun up rapidly whenever necessary. Once IT has expressed the cost of the options to improve availability, the final decision rests ultimately on the business leaders' shoulders: They will balance the risks against the price tag -- to determine whether to invest in a new platform or to settle for the performance constraints of the one in place.

Define performance and costs

SLA discussions should include the hard constraints on IT performance. For instance, employees located in the same building as the corporate data center experience response times in the milliseconds on productivity apps, while those who use the same apps in that data center from other locations won't see the same performance. Latency has a hard limit based on the speed of data across a WAN, so overall performance could be limited by bandwidth fluctuations. Spell out the process to improve the experience for remote workers, such as to increase bandwidth, transition to a better service provider, data sawtoothing or adopt fully managed networks with priority and quality of service contracts.

What is sawtoothing?

Sawtoothing is a network volume metaphor for when there is a certain amount of bandwidth available and a degree of intelligence in the network. As the network approaches full capacity, embedded intelligence constrains the traffic, which lowers the amount of data on the network at any given time. Devices sending data see that there is extra bandwidth available and thus send more -- which fills up the network again. When the network constrains traffic to bring volume back within limits, the devices send more. If this trend were graphed, it would display builds of data traffic followed by a sharp cutback, followed by growth and another sharp cutback -- like the teeth on a saw.

Do not agree to any specific figures in SLA discussions without testing the adjustment in question first -- SLA maintenance could be impossible. Offer degrees of improvement rather than absolute perfection: Doubling bandwidth could give significant performance benefits to remote workers. Additional system redundancy should decrease unplanned downtime. Business leaders want hard numbers and specific promises -- they want to know what they're paying for -- but real-world production IT environments are often unpredictable. IT can only supply specific details once implemented changes affect infrastructure or application performance statistics.

Do not agree to any specific figures in SLA maintenance discussions without testing the adjustment in question first.

Collaborate with business leaders to determine the specific parameters and figures that are the most important for SLA maintenance. Even these figures should not be prescriptive: A latency statement for less than 100 millisecond response time is essentially meaningless for the reality of an IT environment. Instead, use ranges or percentiles: Ninety-five percent of remote users who access a system over a managed WAN will have latency of less than 250 milliseconds and 99% of all local users will have latency under 50 milliseconds.

Agreement upkeep

After an SLA is created, thorough monitoring practices are imperative for SLA maintenance. Look for creepage: Are metrics trending in the wrong direction? Identify root causes and provide a plan to address them and return to compliance with the SLA. When problems rest outside of the control of the IT platform -- such as an increase in user workloads -- consult the business side and explain the problem before it becomes a larger issue. Once again, executives must decide if the cost-effective response is to leave things as they are and accept poorer application performance or to invest in additional resources to prevent metrics from breaching agreed-upon SLA limits. If the business chooses the former approach, then modify the SLA appropriately.

The key to modern SLA maintenance is flexibility from the IT service provider. Keep the agreement simple and adaptable. And ensure that the right tools are in place to monitor, measure and manage what happens.

Next Steps

Best practices for strong IoT design SLAs

Dig Deeper on IT systems management and monitoring

Software Quality
App Architecture
Cloud Computing
Data Center