Integrated Reliability
Utility model for software products
Introduction
In the era of digital industry, many processes are becoming automated, making them more autonomous and less dependent on human intervention. Consequently, the demands for software reliability are constantly increasing. Ignoring these requirements can lead to undesirable consequences for people and dangerous man-made phenomena.
To reduce these negative outcomes and enhance the reliability of software products and equipment, Implemica has developed and implemented a specialized software development technology called EDC (stands for Expertise, Duplication, Critical Settings) into its production processes.
Concept
The concept behind this technology is Integrated Reliability.
Concept | Integrated Reliability is development of a software product as an element of a system ina way to ensure the improvement of the overall reliability of the system. |
I.e. any software product operating as part of the System and being a component of the System should be developed in such a way as to integrate (bring) to the System an increase in overall reliability.
The mechanism of implementation of this Concept is as follows. When developing a Software Product (SW) as a part of the System (Fig. 1) it additionally and purposefully implements:
- Protecting critical elements and functions of the SW code from failures.
- Additional SW self-monitoring features.
- Control direct or indirect information received from the System on the operation of elements that are critical for the implementation of the System function and/or man-made failures.
- Real-time analysis of markers of possible alarms and/or System failures.
- Protect and control data storage and data transmission channels, where data loss is unacceptable.
Fig. 1. Interaction of the System and the Software Product
It is possible to implement additional elements of Integrated Reliability by:
- Functions of automatic self-testing of SW;
- Duplicating critical code elements;
- Control of boundary and allowable parameters (values) of variables received from the System;
- Control of direct or indirect sensor readings of the System to permissible values;
- Control of time parameters/responses and their limits for available elements of the System:
- Executions that are not performed without confirmation etc.
The functional and economic feasibility of introducing additional redundancy into the development process and software product under Integrated Reliability can be optimized by evaluating the
Optimization | elements and parameters of the System and the Software Product, critical for operation and technogenic consequences. |
That is, development efforts are prioritized to elements (components, nodes, modules) of the System, that determine functional reliability, and prevention of man-made failures - i.e. critical for reliability:
Figure 2. Critical elements of the System
EDC Technology Components
EDC technology includes 3 basic components:
- Expertise - expertise of critical points of the System and the Software Product.
- Duplication - identification of modules and elements of the System and the Software Product that require duplication to improve fault tolerance and reliability.
- Critical settings - definition of elements and parameters that require constant monitoring for permissible states and values.
These technology components define:
- Architecture and code configuration;
- Features and operating algorithms;
- Design and development procedures, tools and techniques;
- Depth, methods and means of testing.
Component 1 - Expertise
Algorithm for Examination of Critical Points of the System and Software Product
Expertise of critical points of the system is based primarily on the expertise of the System and Software Product developers' accumulated experience and working statistics. Involvement of industry experts is also used.
The algorithm includes:
- Requirements gathering. Obtaining the main purpose, requirements, key business functionality, and business-critical nodes, points, parameters of the project from the customer (with the help of a special questionnaire: Crash and Risk Assessment Form).
- Failure assessment. Analysis of statistical and "historical" data of the System operation or in the absence of such data of its analogs. Determination of conditions and probability of risks of critical failures and loss of performance.
- Critical Module Identification. Definition of Software Product Modules whose failure will lead to complete failure
or impossibility to use business functionality i.e. are critical:
- Critical sensors and the parameters, events, and states they detect and/or measure,
- Time critical parameters/responses,
- Performances that are not executed without confirmation,
- Reliability issues of communication and communication channels,
- Data storage and transmission channels whose loss is not acceptable,
- Computation modules,
- Analytical modules,
- Internet access/autonomous mode,
- Power loss actions and procedures,
- Application unavailability/hung actions and procedures.
- Emergency shutdown procedures.
Based on this analysis, the decision is made and agreed upon on the modules and elements of the
Software Product, requiring duplication to increase fault tolerance and reliability.
In addition, changes and adjustments are made (if necessary) to the Software
Product architecture, development, and testing methods and tools.
Component 2 - Duplication
Methods of Duplicating Software Modules
Considering that the source of errors and failures in the System's operation are:
- Software bugs,
- Thirdparty system malfunctions,
- Hardware failures and malfunctions,
- User actions when working with the system,
the principle of functional and economic feasibility formulated earlier should be applied under Integrated Reliability to elements, nodes, and activities that are critical to the operation of the System.
Practically speaking, that means:
- Ability of the System to have properties mandatory for use!
- Not to be confused with desirable.
The use of EDC - technology provides for the following priority task in the design of the Software:
- Fault prevention - prevention of failure or malfunction.
Taking into account that the solution of such a problem means the presence of mechanisms of error detection (indication) before the generation of control commands, therefore there is the main condition for solving, if necessary, other reliability problems for the System:
- Removal fault - remedial failure
- Fault tolerance - the ability of the system to operate in the presence of faults or failures
- Fault forecasting - estimates of the possibility of failure and its consequences.
Critical elements of the Code generally include:
- Procrastinators are math modules.
- Analyzing Modules.
- Logic Modules.
- Modules for exchanging information with Databases.
- Internet access modules.
- Offline Modules.
- Emergency shutdown module.
The solution of the fault prevention problem for critical Code modules is achieved by means of different redundancy methods. The choice of duplication method takes into account - the module function, the System operation conditions, and possible causes of failures and errors.
The following methods are recommended in the practice of duplication:
Functional Duplication Method
This method is implemented in Code by duplicating the critical function - f(x) with a duplicating function - fD(x).
Figure 3. Model of Functional Duplication method
The features of the code element implementing the function fD(x) are as follows:
- To be developed by another engineer or development team,
- Developed using alternative algorithms,
- Tests are developed with an alternative approaches.
Inverse Function Duplication Method
This method is implemented only for:
- Functions having an inverse function,
- Functions allowing to define the inverse function in a strictly defined range of operating values of the System.
Duplicating the critical function f(x) using the inverse function finv is shown below:
Figure 4. Model of Inverse Function duplication method
Random Component Duplication Method
This method is implemented in Code only for functions for which the condition is true:
f(x) = f(x + Randomx) - fR(Randomx)
where Random is a random argument offset value, so that fR(Randomx) is a function compensating the random component in general or in a strictly defined range of operating values of the System.
Duplicating the critical function f(x) using the Random Component method is shown below:
Figure 5. Model of Random Component duplication method
Duplication Over Time Method
The essence of this method is to repeatedly calculate the function f(x) with a time delay and compare the results.
Example: If the operating conditions of the System are such that the main causes of failures and errors include various kinds of electrical and electromagnetic disturbances, the method of duplication in time (optionally in combination with the above-mentioned ones) provides the implementation of fault prevention.
Figure 6. Model of Duplication Over Time method
The methods of duplication are not limited to those mentioned. The key consideration when selecting duplication methods and their combinations is to take into account the function of the module, the operating conditions of the System, and the potential causes and sources of failures and errors.
Component 3 - Critical Settings
Parameters and Elements Requiring Constant Monitoring
The use of EDC - technology provides when developing the Software Product solution of the following important tasks of the Integrated Reliability concept implementation in the System:
- Control direct or indirect information received from the System on the operation of elements that are critical for the implementation of the System function and/or man-made failures. Including direct, indirect, or duplicate control:
- Boundary and allowable parameters (values) of variables obtained from the Systems;
- Direct or indirect readings of the System sensors to the permissible values and/or deviations from the prescribed values.
Variable value data or sensor readings are stored in the Database as a matrix of feature descriptions of objects or numerical tables of values. In this case, all information received by the Software Product from the System is constantly monitored for compliance with the attribute and/or allowable values.
|
|
Duplication of System control functions in the Software Product is an important element of Integrated Reliability.
- Control of time parameters/responses and their limits for available elements of the System.
- Introducing mandatory confirmation to critical executive operations i.e. which are not executed without confirmation.
- Creation of Statistical-Analyzing module of the Software Product which accumulates and systematizes statistical information of the System operation in order to predict and prevent emergency situations or loss of mandatory properties of the System.
Data of the Statistical and Analyzing Module are important for development of the System and creation of its improved modifications. Integrating and utilizing Artificial Intelligence to improve System performance.
Case Studies
- Reliable communication and storage components in unstable internet connectivity conditions, for leading Japanese hardware manufacturer Sodick.
- Reliable communication with third-party services that may not always operate, for Texas-based e-learning company World Education.
- Reliable alarm notifications for emergency and safety-related scenarios, for a Swiss-based watch vendor Aidwatch.