O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Autonomic Computing and Self Healing Systems

652 visualizações

Publicada em

  • Seja o primeiro a comentar

Autonomic Computing and Self Healing Systems

  1. 1. Autonomic Computing and Self-Healing Systems William Chipman Spring 2011 Colorado State University Dr. France / Dr. Georg
  2. 2. Introduction Self-adapting systems have been developed to address the needs and are an important part of many critical systems. While the ideas and models for self-adapting systems have been around for quite a while, the current research and development has made many strides towards true self-healing, auto-modifying systems. These systems can be designed to adapt to the needs of the underlying system at run-time or while running based on changes in the system as a whole. Self-adapting systems have both advantages and disadvantages; some of which are general to all self-adapting systems and others that are specific to implementations. This paper will include a thorough description of several systems as well as the specific advantages, disadvantages and known issues along with suggested solutions. There are four characteristics of self-managing systems: self-configuration, self- optimization, self-healing and self-protection [1]. For the system to be complete in its implementation all four of the characteristics must be satisfied. In the current environment this has proven to be a difficult endeavor. Addressing all four points can make a system grow to an unwieldy size and level of complication. There have been many advances in the design of tools and implementations in recent years that have made these autonomic, self-healing systems close to a reality instead of a pipe dream. The beginnings of the development of these self-managing systems are built around Model Driven Engineering. “In MDE, a model is an abstraction or reduced representation of a system that is built for specific purposes” [2]. This reduced abstraction simplifies the system design so that the overall behavior and interactions can be mapped out. Once this overview is completed, appropriate self-representations become important
  3. 3. to the continuation of the development. “It is critical that such representations be causally connected” [2]. This is important because (1) “the model as interrogated should provide up-to-date and exact information about the system to drive subsequent adaptation decisions; and (2) if the model is causally connected, then adaptations can be made at the model level rather than at the system level” [2]. In order to achieve these systems, developers have had to both develop the tools for designing the systems and use the tools to build the actual systems. This paper will begin with a thorough description of autonomic computing and self-healing systems. It will then describe the tools utilized and will follow with detailed descriptions of several past and current systems that incorporate the self-managing ideals. Autonomic Computing “The term autonomic computing was first used by IBM in 2001 to describe computing systems that are said to be self-managing” [10]. Autonomic computing is centered on the idea of removing human intervention in the system. The main goal is to design and then develop systems that will adapt to changes in their environment on their own. The description of autonomic computing from IBM compared it to the complexity of the human body and the autonomic nervous system because of its self-managing ability. “A system with autonomic capabilities installs, configures, tunes and maintains its own components at runtime” [3]. IBM describes the four main properties of self- management as self-configuration, self-optimization, self-healing and self-protection. Self-configuration says that a system can reconfigure itself based on high-level goals and
  4. 4. modeling. Self-optimization says that a system will make the best use of its resources. Self-healing is the ability to detect and diagnose issues or problems and self-protection in a system means that system can protect itself from malicious attacks and from unintended or inadvertent changes. There are five levels of autonomicity proposed by IBM as the Autonomic Computing Adoption Model Levels. Level 1 is the basic level. At this level, the system elements are managed by highly skilled staff and changes are made manually. Level 2 is referred to as the Managed level. At this level, the system monitors itself and is intelligent enough to reduce some of the burden on the system administrators. The Predictive level, level 3, is characterized by the system using modeling of behavior to recognize system-wide behavioral patterns and suggests fixes to the IT staff. Level 4 is the Adaptive level. At this level, human interaction is minimized and the tools that were used at level 3 and automated more so that the burden on the IT staff is minimal. The fully autonomic level is the final level. At level 5, systems are able to self-manage almost all functionality related to the needs of the system. The basic building block in an autonomic system is the Autonomic Element (AE). An autonomic element is a software-based component that is responsible for managing sub-systems. “Autonomic elements may cooperate to achieve a common goal … [such as] servers in a cluster optimizing the allocation of resources to application to minimize the overall response time or execution time of the applications” [10]. Autonomic elements implement the MAPE-K loop as the control loop for the managing of the sub- system. A complete description of the MAPE-K loop will be seen in the Tools section of this paper.
  5. 5. Variability models can be used to build autonomic elements and systems at runtime. “The use of variability models at runtime brings new opportunities for autonomic capabilities by reutilizing the efforts invested at design time” [3]. Systems leveraging the variability model can use knowledge of the design to attain autonomic modifications at compile-time and can further use system modeling to self-modify at runtime. Standards such as the meta-data exchange allow the models that were used at design-time to additionally be used at run-time. The negative aspect of variability models is the potential for exponential explosion of the possible state transitions. “In order to manage variability and avoid the combinatorial explosion of artifacts needed to support this variability, [software tools] focus on variation points and variants instead of focusing on whole configurations” [16]. A very specific type of autonomic system is the self- healing system. These systems are often designed using the variability model. Self-healing Systems “A self-healing system is one that: replaces traditional error messages with robust error detection, handling and correction that produces telemetry for automated diagnosis, provides automated diagnosis and response from the error telemetry for hardware and software entities, provides recursive fine-grained restart of services based upon knowledge of their dependencies, presents simplified administrative interactions for diagnosed problems and their effects on services and resources” [17]. There are several ways that self-healing systems can be implemented. The simplest implementation is through redundancy. Adding duplicate components for critical systems all the way up to duplication of the entire system can allow for fail-safe operations. The issue with this is that this is inherently wasteful of resources. The redundant components could be used
  6. 6. productively with a more innovative implementation. In addition to the wastefulness of this, it only addresses total failures of components in the system. It does not address degradation in the system. To heal these types of issues, a more robust solution was needed. In many implementations of self-healing systems, a multi-faceted approach is needed. “Two distinct elements are required for the development of self-healing systems. First, an automated or semi-automated agent must be present to make the decision of when and how to affect repair on a system. Second, an infrastructure for actually executing the repair strategy must be available to that agent” [7]. The use of managers is the favored approach. In Solaris 10, managers to address faults and service issues were used. The fault manager uses a system level model to determine when a failure or degradation has occurred and searches through a dynamic list of solutions to determine the most opportune solution. The service manager allows for restarting of services and applications that have failed or degraded below a pre-determined threshold. This pre- determined threshold is given by the application to the service manager when it enters use on the system. A very popular approach to developing self-healing systems is architecture- centric. “An architectural style is a collection of constraints on components, connectors and their configurations targeted towards a family of systems with shared characteristics” [13]. In architecture-based self-healing system, to repair a running system, the changes have to be machine readable by the underlying system as well as the describing systems. This machine readable change instruction is referred to as an architectural difference or
  7. 7. diff. A diff describes the difference in the system before and after the repair. A diff is comprised of components, links, connectors and interfaces. According to Mikic-Rakic et.al., “self-healing systems should satisfy: adaptability, dynamicity, awareness, autonomy, robustness, distributability, mobility and traceability” [13]. 1. Adaptability: The system must allow changes to both the static and dynamic portions of the system. 2. Dynamicity: Addresses the run-time changes that a system is able to make. 3. Awareness: The architectural style must allow for self- monitoring in the system. 4. Autonomy: Autonomy is completed through the system being able to address the anomalies discovered. 5. Robustness: The architectural style should allow for the system to respond to unforeseen conditions. 6. Distributability: The system must have good performance in different distributions. 7. Mobility: The architecture should allow for modifications to the location of components in the system. 8. Traceability: A system should allow for a direct correlation between the model and the run-time execution. [13]
  8. 8. Some of these requirements are basic building blocks for systems and are used to enforce a system level hierarchy on the data-flow and basic structure of the system while others administer the dynamic changes to the system based on the data-flows. These dynamic indicators analyze specific aspects of the system and address and implement the needed changes. “The ability to dynamically repair a system at runtime based on its architecture requires several capabilities: the ability to describe the current architecture of the system; the ability to express an arbitrary change to that architecture that will serve as a repair plan; the ability to analyze the result of the repair to gain confidence that the change is valid; and the ability to execute the repair plan on a running system without restarting the system” [7]. While self-healing systems are an innovative idea, there are still many issues in their design and implementation that must be overcome in order for them to develop into an integral part of a computer system. “Self-healing functionality for users and administrators of a modern operating system [must] provide fine-grained fault isolation and restart where possible of any component –hardware or software – that experiences a problem” [17]. Without this fine-grain fault isolation, fixing the problem becomes overkill in most situations. Any general fault will be addressed with a similar approach: restart the component or application. This is not always an optimal or applicable solution. With systems that are considered real-time and critical, often the downtime required for such an overzealous solution is not available. With the finer level of fault isolation, small problems can be fixed in a way that does not cripple the system even for a short period. A second issue is tool integration. Seamless integration is “especially important in the context of self-healing systems since no human can be involved in manually
  9. 9. transforming tool outputs or invoking tools” [7]. This integration has to be performed with multiple tools. Current self-healing systems are so complex that no single tool encompasses all the needs. With multiple tools added to an already complex system, the system can become unwieldy. To ameliorate this complexity, using tools that have been thoroughly tested is of the utmost importance. In addition to this, managing the growing complexity of fix models can grow exponentially as the system grows. Predetermined solutions and exhaustive solution searches can also lead to problems. The solution space can grow exponentially and determining the best solution to a problem can be subjective with computing systems obviously only capable of making objective decisions. The third issue facing self-healing systems follows from a potential solution to the previously described solution. Building solutions to problems at runtime or at component/application integration time based on detailed models of the system can be a potential solution but the main issue with this is that “in an open system, upfront system analysis is at best of limited, heuristic usefulness” [8]. Because of the fact that open systems tend to be dynamic and are molded by their environment, design-time models become less relevant as the system grows and changes during runtime. While these design-time models are important, a combination of all types of solutions is needed to build robust systems that can grow and morph into what is truly needed. Tools There are many tools designed for building autonomic and self-healing systems. The first and most widely used tool is the MAPE-K loop. The MAPE-K loop stands for monitor, analyze, plan, execute and knowledge. The monitor “collates and aggregates
  10. 10. information received into the system and attempts to characterize any symptoms relating to the way the system is running” [12]. The analyze phase processes the symptoms to determine if the issues at hand need to be addressed. At the plan step, the system decides what and how changes can be handled for a successful implementation. The execute stage is where the plan is implemented and activated. After a completed execute phase, the knowledge phase of the cycle attempts to determine if the implementation of the plan was successful in instigating the necessary changes. Figure 1. MAPE-K Loop Design [12]
  11. 11. The MAPE-K loop design was first proposed by IBM as a solution to autonomic computing design. Systems built with the MAPE-K ideal tend to be robust. “The MAPE- K loop is controlled by a manager, an embedded part of the autonomic element that coordinates the individual activities” [1]. This manager is an integrated part of the underlying system. Often systems will have several managers running the MAPE-K loop simultaneously, autonomously and distributed. The way that the autonomic managers interact with each other is determined by the autonomic computing architecture. This integration of several managers leads to the next logical step of developing autonomic software product lines (ASPLs). ASPLs can self-manage a large and complex system and interact with other systems both local and distributed in order to deal with product variations and dynamic system changes. There has been a good deal of research into the ASPL concept and the Software Engineering Institute (SEI) has developed a framework. This framework divides systems into three general categories: core assets development, product development and management. The core assets are the basic components in the SPL. They can range from business artifacts to reference architectures. The product development is what is built with the core assets. They comprise the larger portions of the system that perform the objectives of the design. The final portion is the management. The management is what provides maintenance to all product developments as well as monitors the system to determine where potential problems will likely occur based on models. “Fractal is an advanced component model and associated on-growing(sic) programming and management support devised initially by France Telecom and INRIA since 2001” [5]. The Fractal model is based on component-based software engineering. It
  12. 12. makes use of components, interfaces, which are interaction points between those components, and bindings which are the communication channels between components. Fractal also uses the concepts of membranes and contents. “The membrane exercises an arbitrary reflexive control over its content” [5]. A membrane is made up from a set of controllers. “The model is recursive with sharing at arbitrary levels” [5]. The model is programming language independent. It is extensible. Bindings are controlled through the specific programming. “The Fractal project targets the development of a reflective component technology for the construction of highly adaptable and reconfigurable distributed systems” [5]. Fractal enforces a limited number of architectural structures. This allows the systems to be more robust as the specified component need not exist at runtime in order for successful management. The final tool analyzed is SmartAdapters. SmartAdapters are used to decrease the complexity of dynamically adaptive systems. The first step in the use of SmartAdapters is the maintenance of a high-level representation model of the running systems. Maintaining the high-level model allows for a quicker and more through response to issues as they arrive. “SmartAdapters automatically generate an extensible Aspect-Oriented Modeling framework specific to [the] metamodel” [16]. These metamodels are used to control the potentially exponential growth of solutions. “Using Aspect-Oriented weavers, whole configurations can be built on-demand by selecting a set of aspects in practice [using] SmartAdapters” [16]. Components that occur in all configurations are the base models and are used to weave the aspects of the system. “In SmartAdapters, an aspect is composed of three parts: i) an advice model, representing what [to] weave, ii) a pointcut model, representing where [to] weave the
  13. 13. aspect and iii) weaving directives specifying how to weave the advice model at the join points matching the pointcut model” [16]. The advice model is a portion of the model that is potentially having an issue. The pointcut model also represents a portion of the model but it is described by the roles in the system. The weaving directives specify how to weave an aspect building from the advice model to the pointcut model using the domain- specific language of the system. Implementations Early self-managing projects were funded by DARPA for the military. The first was Situational Awareness System (SAS). It was created to aid in communication between soldiers on the battlefield. The communication devices had to be durable and able to deal with harsh conditions and potential jammers to the communication channels. The design of the system was a distributed peer-to-peer systems with self-healing communication channels. “The DARPA Self-Regenerative Systems program started in 2004 is a project that aims to develop technology for building military computing systems that provide critical functionality at all times, in spite of damage caused by unintentional errors or attacks” [10]. The four aspects of this project are (1) software made resistant to errors and attacks, (2) binary code that is modifiable to make attacks harder when trying to exploit vulnerabilities, (3) a scalable architecture that is intrusion tolerant and (4) the ability to build systems that can attempt to detect malicious inside users and block attempts to attack the system.
  14. 14. NASA, in 2005, began work on the Autonomous NanoTechnology Swarm (ANTS). The project was designed to launch a swarm of 1000 small spacecrafts into an asteroid belt and use the information gathered to determine which asteroids were deemed interesting for further investigation. The ships would be required to use autonomic techniques to continually elect a leader and rebuild communication channels. MAPE-K implementations include the Autonomic toolkit, ABLE, Kinesthetics eXtreme and Self-Management Tightly Coupled with Application [10]. The Autonomic toolkit is a prototype implementation of the MAPE-K loop built in Java but able to communicate through XML to other applications. ABLE is also a toolkit designed by IBM but it is designated for use in multi-tangent systems that need self-management implementations. Kinesthetics eXtreme is a complete autonomic loop designed mainly in Java that is focused on adding autonomic abilities to legacy systems that may not have been designed with autonomic capabilities. Finally, Self-Management Tightly Coupled with Application is a project with the goal of developing middleware frameworks that offer self-management functionality to applications. There are currently eight platforms that support Fractal components in multiple programming languages. “Julia was historically (2002) the first Fractal implementation” [5]. It was developed and used by the France Telecom. It is based in Java and was developed to prove that component-based systems did not have to perform inefficiently. “Think is a C implementation of Fractal” [5]. Think is also available through France Telecom but has development assistance through STMicroelectronics. Think is used to build kernels of all sorts. The kernels range from exo-kernels and micro-kernels to low memory complete operating systems. “ProActive is a distributed and asynchronous
  15. 15. implementation of Fractal targeting grid computing” [5]. France Telecom developed ProActive as middleware for parallel, concurrent and distributed computing grids. It is object based and allows for asynchronicity deployment and management. “AOKell is a Java implementation by INRIA Jacquard” [5]. It is similar to Julia but based on AOP using membranes for load time weaving. Its performance is similar to that of Julia. “FractNet is a .Net implementation of the Fractal component model developed by the LSR laboratory” [5]. It is a port of AOKell to the .Net platform. It is similar in design and performance also. “Flone is a Java implementation of the Fractal component model developed by INRIA Sardes for teaching purposes” [5]. It is not a full implementation but instead is a group of APIs that simplify the Fractal model so that it is more easily understood by students. “FracTalk is an experimental Smalltalk implementation of the Fractal component model developed at Ecole des Mines de Douai” [5]. FracTalk focuses on dynamic elements in component-based programming. “Plasma is a C++ experimental implementation of Fractal developed at INRIA Sardes” [5]. It is dedicated to building multimedia applications that are self-adaptive. Fractal also has a complete repertoire of open components for middleware and operating systems. The smart home feature model is an autonomic computing solution described by Cetina [3]. The design is such that a smart home can be fully automated and yet still dynamically adjust to the changing patterns of the residents and the influx and removal of components. The goal of autonomic computing for smart homes is “to reduce … configuration effort, [so that] smart homes can provide the following autonomic capabilities: Self-configuration. New kinds of devices can be incorporated into the system; Self-healing. When a device is removed or fails, the system should adapt itself to
  16. 16. offer its services using alternative components; Self-adaptation. Users’ needs differ and change over time. The system should adjust its services to fulfill user preferences” [3]. This behavior is similar to context adaptation. The Model-Based Reconfiguration Engine (MoRE) is used to implement the management of the models used in the system. The operations of the engine are used to determine how to evolve the system to meet future needs and reconfigurations. Reconfiguration actions fall in to three categories: 1. Component actions: components that must be installed, uninstalled or reconfigured 2. Channel actions: Communications for active components 3. Model actions: updates to the MoRE model after the component and channel actions occur. [3] Figure 2. Smart Home Model [3]
  17. 17. Because any change to the system can trigger a need for a change to the model, the high-level models must be maintained and updated. This allows the system to continually gather and process information about the dynamicity of the system and affords the MoRE the ability to develop and implement solutions. Conclusion Self-healing and autonomic systems have begun to integrate themselves in to many more generalized computing systems. The tools that are used to build these systems have been developed and optimized to make the best use of the underlying models and component architectures. The MAPE-K loop and the Fractal modeling system are two of the most accepted and widely used tools for developing these autonomic systems. The MAPE-K loop, proposed by IBM, is used to develop manager driven systems that interact to build solutions. These solutions are then integrated into the running system. Many active self-healing systems are based on MAPE-K. These include the Autonomic toolkit, ABLE, Kinesthetics eXtreme and Self-Management Tightly Coupled with Application. All of these implementations have well-documented success. The Fractal modeling system, based on component models, was designed and implemented by France Telecom and has been used to implemented multiple systems across many different languages. The Fractal modeling system is much more complicated than the MAPE-K loop but the systems appear to be more straight-forward to implement. These and other tools have been used to build multiple systems ranging from smart homes to deep-space multi-object space probes. The unifying aspect of all the systems built to be autonomic and self-healing is that they tend to be complicated. This
  18. 18. level of complexity can grow exponentially as the size of the system grows. To overcome this complexity, models and metamodels are often used to describe the runtime systems. These models are built to speed the processing of solutions but as often as not just add more complexity and sprawl to the system. While all of the tools described and used to build the implementations of self- healing systems are useful and well-developed, they would likely work best in a conglomeration. By merging the best aspects of the tools and design models, functionality will grow at a faster rate than the complexity of the implementation. Managers used in the MAPE-K loop could be utilized in the MoRE to build more cohesive systems. These systems and managers using the Fractal model management ideals would be more likely to have lower reaction time to changes in the environment and additionally would suffer from less model creep when describing the potential solution space. In addition to this merger of the best portions from multiple tools, there is a potential to use database-style storage of known good configurations as they are implemented along with new advanced and streamlined searching technologies to expedite the implementation of changes as needed. Autonomic systems and self-healing systems are going to become a part of most systems as the computing industry embraces the ideas and realizes the inherent good in the design. These systems will become more complex and all-encompassing as they become more commonplace. New tools will have to be developed to aid in the implementation of these new systems and a better understanding of system modeling will be needed by all individuals that interact with the development of these systems. Industry will embrace the usefulness of these systems but in order for there to be success in the
  19. 19. implementations steps must be taken to merge the best aspects of the tools and to train the engineers on modeling and best practices.
  20. 20. References [1] Abbas, N.; Andersson, J.; Loewe, W.; , “Autonomic Software Product Lines (ASPL). Proceeding of the 7th international conference on Autonomic computing (ICAC '10). ACM, New York, NY, USA, pp.324-331. 2010 [2] Blair, G.; Bencomo, N.; France, R.B.; , "Models@ run.time," Computer , vol.42, no.10, pp.22-27, Oct. 2009 [3] Cetina, C.; Giner, P.; Fons, J.; Pelechano, V.; , “Autonomic computing through reuse of variability models at runtime: the case of the smart homes” Computer , vol.42, no.10, pp.37-43, Oct. 2009 [4] Cheung-Foo-Wo, D.; Tigli, J.; Lavirotte, S.; Riveill, M.; “Self-adaptation of event- driven component-oriented middleware using aspects of assembly.” Proceedings of the 5th international workshop on Middleware for pervasive and ad-hoc computing: ACM, New York, NY, USA, pp.31-36. 2006 [5] Coupaye, T.; Stefani, J-B.; “Fractal component-based software engineering.” Proceedings of the 2006 conference on Object-oriented technology: ECOOP 2006 workshop reader (ECOOP'06), Springer-Verlag, Berlin, Heidelberg, pp.117- 129. 2006 [6] Dabrowski, C.; Mills, K.; , “Understanding self-healing in service-discovery systems.” Proceedings of the first workshop on Self-healing systems (WOSS '02), ACM, New York, NY, USA, pp.15-20. 2002
  21. 21. [7] Dashofy, E.; Hoek, A.; Taylor, R.; , “Towards architecture-based self-healing systems.” Proceedings of the first workshop on Self-healing systems (WOSS '02), ACM, New York, NY, USA, pp.21-26. 2002. [8] Fickas, S.; Hall, R.; , “Self-Healing Open Systems.” Proceedings of the first workshop on Self-healing systems (WOSS '02), ACM, New York, NY, USA, 99- 101. 2002 [9] George, S.; Evans, D.; Davidson, L.; ,“A biologically inspired programming model for self-healing systems.” Proceedings of the first workshop on Self-healing systems (WOSS '02), ACM, New York, NY, USA, 102-104. 2002 [10] Huebscher, M.; McCann, J.; “A survey of autonomic computing—degrees, models, and applications.” ACM Comput. Surv. 40, 3, Article 7 (August 2008), 28 pages. 2008 [11] Maoz, S.; , "Using Model-Based Traces as Runtime Models," Computer , vol.42, no.10, pp.28-36, Oct. 2009 [12] Mengusoglu, E.; Pickering, B.; , “Automated management and service provisioning model for distributed devices.” Proceedings of the 2007 workshop on Automating service quality: Held at the International Conference on Automated Software Engineering (ASE). ACM, New York, NY, USA, pp.38-41. 2007 [13] Mikic-Rakic, M.; Mehta, N.; Medvidovic, N.; , “Architectural style requirements for self-healing systems.” In Proceedings of the first workshop on Self-healing systems WOSS '02, ACM, New York, NY, USA, pp.49-54. 2002.
  22. 22. [14] Morin, B.; Barais, O.; Jezequel, J.-M.; Fleurey, F.; Solberg, A.; , "Models@ Run.time to Support Dynamic Adaptation," Computer , vol.42, no.10, pp.44-51, Oct. 2009 [15] Morin, B.; Fleurey, F.; Bencomo, N.; Jezequel, J.-M.; Solberg, A.; Dehlen, V.;Blair, G.; , “An Aspect-Oriented and Model-Driven Approach for Managing Dynamic Variability,” Proceedings of the 11th international conference on Model Driven Engineering Languages and Systems (MoDELS '08), Springer-Verlag, Berlin, Heidelberg, pp.782-796. 2008 [16] Morin, B.; Barais, O.; Nain, G.; Jezequel, J.; , “Taming Dynamically Adaptive Systems using models and aspects.” Proceedings of the 31st International Conference on Software Engineering (ICSE '09). IEEE Computer Society, Washington, DC, USA, pp.122-132. 2009 [17] Shapiro, M.; “Self-Healing in Modern Operating Systems.” Queue 2, 9, pp.66-75, 2008 [18] Weyns, D.; Malek, S.; Andersson, D.; , "FORMS: a formal reference model for self- adaptation.” Proceeding of the 7th international conference on Autonomic computing (ICAC '10). ACM, New York, NY, USA, pp.205-214. 2010