The Role of Orchestration in Service Level Management

Brad Stone
President, Aspirin Software

As businesses move to better align resource usage with corporate objectives, IT departments are increasingly being measured on their ability to meet specific service level objectives associated with the computing infrastructure. Such objectives define goals for the availability and performance of services that leverage IT server, storage and network resources. Services include support for internal business applications as well as customer-facing services such as an e-commerce application. Service level management vendors such as InfoVista and Mercury Interactive have delivered products in the last few years to directly track IT service levels and report on compliance. However, solutions that automate the remediation of service level violations remain elusive.

Automated response to IT infrastructure issues is difficult because of the complexity of today's datacenter environments. Multi-tier applications, multi-vendor environments, and dynamic and unpredictable user transactions all make diagnostics and root cause analysis a difficult task. While event management tools such as SMARTS InCharge® and Micromuse Netcool® make it easier to diagnose the underlying cause of problems, they lack the low-level interfaces and control mechanisms to support autonomic, or self-healing capabilities. As a result, responses are limited to problem escalation and notification, and simple script-based actions.

The capabilities of existing tools for automated remediation are insufficient because the necessary actions are often more complex than can be addressed by launching a script. A corrective action may require the coordinated response of a distributed set of diverse devices. Based on policy, the response may need to be scheduled in order to not disrupt other production activities. In addition, some responses may be intended as temporary solutions, and thus a mechanism must be in place to reset the environment to a prior state at a later time. Event management tools weren't designed for this, and don't have this level of sophistication. Once an action is triggered, its behavior and results are no longer tracked.

Thus, what is needed is an orchestrator, or workflow engine, that can interact with event management systems to handle the complex task of service level control. Workflow engines are designed to support complex, policy-driven, multi-step actions. Workflow engines can be combined with schedulers and event management systems and integrated with an agent architecture to provide a powerful tool for service level management.

The benefits of an orchestrator can be seen when applied to various service level mediation scenarios. Here we provide a few availability and performance management examples.

To meet availability objectives, IT will sometimes turn to proprietary hardware and high availability middleware solutions such as Hewlett-Packard's MC/ServiceGuard®. However, these solutions are expensive to set-up and complex to operate and maintain, and as a result companies are now looking both at commodity hardware and open source high availability solutions to address these objectives. In fact, the open operating systems such as Linux® contain some high availability features, and systems can be paired up with fault detection tools to provide an inexpensive HA solution. In such a model, a backup system maintains a copy of the active system's operating system and application stack and typically access to a replicated database such as Oracle and Solid provide. This approach can work, but requires twice the number of servers (i.e. an extra standby for each production system). With an orchestrator handling the high availability task, however, a more powerful and less expensive solution can be provided. The orchestrator can dynamically coordinate the complex and multiple steps of provisioning an OS on the backup system, as well as the application and any additional settings need to failover the service. This can allow a single server to function as the backup for multiple active systems, since it is capable of being provisioned with the necessary software when needed.

Comparable issues are faced in performance management. In environments with unpredictable spikes in user demand, performance goals can be difficult to achieve. Designing for over capacity is an expensive approach. An alternative is to add capacity on-demand. But this typically involves not only provisioning a spare server with an operating system and appropriate applications, but also adjusting the configurations of load balancers and potentially firewalls as well. This can be difficult, time-consuming, and error-prone when done under duress. By contrast, a workflow engine can address this by first detecting the performance problem, and then coordinating the multi-step and multi-server response.

Addressing response time issues can present similar challenges. Applications can be running with different priorities, but can be competing for the same shared resources such as network access. When goals are not being met, the corrective action may involve actions both on the server and on network devices. For example, the IT administrator might need to change priorities for the applications on network quality-of-service (QOS) devices to give the lower priority application a smaller share of the network bandwidth. This may lead to application delays in the lower priority application, so an additional change may be needed to reconfigure timeout values for the application and then restart it. This multi-step and multi-server response is again best handled by a workflow engine, which can track the progress of the tasks and address any dependency issues.

Workflow technology is not new to the datacenter. Trouble ticketing is one area where workflow concepts are already being applied. IT departments may have service level objectives associated with their responsiveness to trouble tickets. In this case it may be important to have the metrics integrated with service level metrics for performance and availability. Many datacenter automation vendors recognized early on the value of this approach. IBM, Cassatt, Optinuity, and UXComm all are examples of vendors using workflow techniques for their automation products.

The movement to workflow-based approaches will continue. Current solutions can't scale to address the complexities of modular computing, virtualization, and on-demand requirements. Existing automation vendors such as OpsWare and BladeLogic recognize that they need to add workflow capabilities to their product suites. A workflow engine has now become an expected and required component of a datacenter's management automation architecture.


Brad has delivered service level management products for Hewlett-Packard and Resonate, and has provided consulting on service level management through his company, Aspirin Software.