Integrated Reliability
Utility model for software products
Introduction
In the era of digital industry, many processes are becoming automated, making them more autonomous and less dependent on human intervention. Consequently, the demands for software reliability are constantly increasing. Ignoring these requirements can lead to undesirable consequences for people and dangerous man-made phenomena.
To reduce these negative outcomes and enhance the reliability of software products and equipment, Implemica has developed and implemented a specialized software development technology called EDC (stands for Expertise, Duplication, Critical Settings) into its production processes.
Concept
The concept behind this technology is integrated reliability.
Concept | Integrated reliability is development of a software product as an element of a system ina way to ensure the improvement of the overall reliability of the system. |
I.e. any software product operating as part of the System and being a component of the System should be developed in such a way as to integrate (bring) to the System an increase in overall reliability.
The mechanism of realization of this Concept is as follows. When developing a Program Product (PP) as a part of the System (Fig. 1) it additionally and purposefully implements:
- Protecting critical elements and functions of the PP code from failures.
- Additional PP self-monitoring features.
- Control direct or indirect information received from the System on the operation of elements that are critical for the realization of the System function and/or man-made failures.
- Real-time analysis of markers of possible alarms and/or System failures.
- Protect and control data storage and data transmission channels the loss of which is unacceptable.
Fig. 1. Interaction of the System and the Software Product
It is possible to implement additional elements of Integrated Reliability by:
- Functions of automatic self-testing of PP;
- Duplicating critical code elements;
- Control of boundary and allowable parameters (values) of variables received from the System;
- Control of direct or indirect sensor readings of the System to permissible values;
- Control of time parameters/responses and their limits for available elements of the System:
- Executions that are not performed without confirmation etc.
The functional and economic feasibility of introducing additional redundancy into the development process and software product under Integrated Reliability can be optimized by evaluating the
Optimization | elements and parameters of the System and the Software Product, critical for operation and technogenic consequences. |
That is development efforts are prioritized to elements (components, nodes, modules) Systems that determine functional reliability and prevention of man-made accidents - i.e. critical for reliability:
Figure 2. Critical elements of the System
EDC Technology Components
EDC technology includes 3 basic components:
- Expertise - expertise of critical points of the System and the Software Product.
- Duplication - identification of modules and elements of the System and the Software Product that require duplication to improve fault tolerance and reliability.
- Critical settings - definition of elements and parameters that require constant monitoring for permissible states and values.
These technology components define:
- Architecture and code configuration;
- Features and operating algorithms;
- Design and development procedures, tools and techniques;
- Depth, methods and means of testing.
Component 1 - Expertise
Algorithm for Examination of Critical Points of the System and Software Product
Expertise of critical points of the system is based primarily on the expertise of the System and Software Product developers' accumulated experience and working statistics. Involvement of industry experts is also used.
The algorithm includes:
- Obtaining from the customer - developer with the help of a special questionnaire the main purpose and key business functionality of the project.
- Request to Customer for critical nodes, points, parameters - Crash and Risk Assessment Form.
- Analysis of statistical and "historical" data of the System operation or in the absence of such data of its analogs. Determination of conditions and probability of risks of critical failures and loss of performance.
- Determination of requirements to the Software Product.
- Definition of Software Product Modules whose failure will lead to complete failure or impossibility to use business functionality i.e. are critical.
- Critical sensors and the parameters, events, and states they detect and/or measure.
- Time critical parameters/responses.
- Performances that are not executed without confirmation.
- Reliability issues of communication and communication channels.
- Data storage and transmission channels whose loss is not acceptable.
- Counting (math) modules.
- Analyzing Modules.
- Internet access/autonomous mode.
- Unplugging the Power - Actions.
- The program is "hung" or unavailable.
- Emergency shutdowns.
On the basis of the received information, the decision on the modules and elements of the Software Product requiring duplication to increase fault tolerance and reliability is made and agreed upon. In addition, changes and adjustments are made to the Software Product architecture, development, and testing methods and tools.
Component 2 - Duplication
Methods of Duplicating Program Modules
Considering that the source of errors and failures in the System's operation are:
- Software bugs.
- Hardware failures and malfunctions.
- User actions when working with the system.
The principle of functional and economic feasibility formulated earlier should be applied under Integrated Reliability to elements, nodes, and activities that are critical to the operation of the System.
Practically speaking, that means:
- Ability of the System to have properties mandatory for use!
- Not to be confused with desirable.
The use of EDC - technology provides for the following priority task in the design of the Software:
- Fault prevention - prevention of failure or malfunction.
Taking into account that the solution of such a problem means the presence of mechanisms of error detection (indication) before the generation of control commands, therefore there is the main condition for solving, if necessary, other reliability problems for the System:
- Removal fault - remedial failure
- Fault tolerance - the ability of the system to operate in the presence of faults or failures
- Fault forecasting - estimates of the possibility of failure and its consequences.
Critical elements of the Code generally include:
- Procrastinators are math modules.
- Analyzing Modules.
- Logic Modules.
- Modules for exchanging information with Databases.
- Internet access modules.
- Offline Modules.
- Emergency shutdown module.
The solution of the fault prevention problem for critical Code modules is achieved by means of different redundancy methods. The choice of duplication method takes into account - the module function, the System operation conditions, and possible causes of failures and errors.
The following methods are recommended in the practice of duplication:
Functional Duplication Method
This method is implemented in Code by duplicating the critical function - f(x) with a duplicating function - fD(x).
Figure 3. Model of functional duplication
The features of the code element implementing the function fD(x) are as follows:
- To be developed by an alternative developer/development team.
- Developed using alternative algorithms.
- Tests are developed with alternative approaches.
Inverse Function Duplication Method
This method is implemented in the Code only for:
- Functions having an inverse function
- Functions allowing to define the inverse function in a strictly defined range of operating values of the System.
Duplicating the critical function - f(x) using the inverse function finvis presented.
Figure 4. Model of inverse function duplication model
Random Component Duplication Method
This method is implemented in Code only for functions for which the condition is true:
f(x) = f(x + Random) - fR(Random)
where Random is a random value of the argument fR(Random) is a function compensating the random component in general or in a strictly defined range of operating values of the System.
Duplicating the critical function - f(x) using the "random component" method is presented.
Figure 5. Model of duplication model with random component
Duplication Over Time Method
If the operating conditions of the System are such that the main causes of failures and errors are various kinds of electrical and electromagnetic disturbances, the method of duplication in time and/or in combination of the above mentioned gives the realization of fault prevention.
The essence of this method is to repeatedly calculate the function f(x) with a time delay and compare the results.
Figure 6. Model of duplication over time
The methods of duplication are not limited to the given ones. The main thing when selecting duplication methods and their combination is to take into account - the function of the module, the operating conditions of the System, and possible causes and sources of failures and errors.
Component 3 - Critical Settings
Parameters and Elements Requiring Constant Monitoring
The use of EDC - technology provides when developing the Software Product solution of the following important tasks of the integrated reliability concept realization in the System:
- Control direct or indirect information received from the System on the operation of elements that are critical for the realization of the System function and/or man-made failures. Including direct, indirect, or duplicate control:
- Boundary and allowable parameters (values) of variables obtained from the Systems;
- Direct or indirect readings of the System sensors to the permissible values and/or deviations from the prescribed values.
Variable value data or sensor readings are stored in the Database as a matrix of feature descriptions of objects or numerical tables of values. In this case, all information received by the Software Product from the System is constantly monitored for compliance with the attribute and/or allowable values.
|
|
Duplication of System control functions in the Software Product is an important element of Integrated Reliability.
- Control of time parameters/responses and their limits for available elements of the System.
- Introducing mandatory confirmation to critical executive operations i.e. which are not executed without confirmation.
- Creation of Statistical-Analyzing module of the Software Product which accumulates and systematizes statistical information of the System operation in order to predict and prevent emergency situations or loss of mandatory properties of the System.
Data of the Statistical and Analyzing Module are important for development of the System and creation of its improved modifications. Integrating and utilizing Artificial Intelligence to improve System performance.
Case Studies
- Reliable tickets delivery for an online ticket store, where delivery time is crucial, for a Chicago-based ticket seller, Score Tickets.
- Reliable data transfer from hardware devices to the cloud, even with an unstable internet connection, for leading Japanese hardware manufacturer Sodick.
- Reliable third-party API integrations, even with the instability and occasional unavailability of those APIs, for Texas-based e-learning company World Education.
- Reliable notifications for emergency and safety-related scenarios, for a Swiss-based watch vendor Aidwatch.