9 mins read
June 16, 2024

Integrated Reliability

Utility model for software products

Introduction

In the digital industry era, numerous processes are being automated, increasing autonomy and reducing dependence on human intervention. Consequently, the demands for software reliability are constantly increasing.
Neglecting these requirements can result in adverse outcomes for individuals and hazardous incidents.

To mitigate these negative outcomes and enhance the reliability of software products and hardware, Implemica has developed and implemented a specialized software development technology called EDC (stands for Expertise, Duplication, Critical Settings) into its production processes.

Concept

The concept behind this technology is Integrated Reliability.

Concept

Integrated Reliability is developing a software product as a system component to enhance the system's overall reliability.

That is, any software product functioning as a component of the system should be designed to contribute to the system's overall reliability.

The mechanism of implementation of this Concept is as follows. When developing a Software Product (SW) as a part of the System (Fig. 1) it additionally and purposefully implements:

Protecting critical elements and functions of the Software Product's and System's functions from failures.
Additional Software Product's self-monitoring features.
Control direct or indirect information received from the System on the operation of elements that are critical for the implementation of the System function and/or man-made failures.
Real-time analysis of markers of possible alarms and/or System failures.
Protect data storages and data transmission channels, where data loss is unacceptable.

fig1
_{Fig. 1. Interaction of the System and the Software Product}

It is possible to implement additional elements of Integrated Reliability by:

Functions of automatic self-testing of the Software Product;
Duplicating critical code elements;
Control of boundary and allowable parameters (values) of variables received from the System;
Control of direct or indirect sensor readings of the System to permissible values;
Control of time parameters/responses and their limits for available elements of the System:
Executions that are not performed without confirmation etc.

The functional and economic feasibility of introducing additional redundancy into the development process and software product under Integrated Reliability can be optimized by evaluating the

Optimization

elements and parameters of the System and the Software Product, critical for operation and technogenic consequences.

That is, development efforts are prioritized to elements (components, nodes, modules) of the System, that determine functional reliability, and prevention of man-made failures - i.e. critical for reliability:

fig2
_{Figure 2. Critical elements of the System}

EDC Technology Components

EDC technology comprises 3 fundamental components:

Expertise - assessment of critical points within the System and the Software Product.
Duplication - identifying modules and elements within the System and Software Product, that necessitate duplication to enhance fault tolerance and reliability.
Critical settings - defining elements and parameters that require continuous monitoring to maintain acceptable states and values

These technology components define:

Architecture and code configuration;
Features and operating algorithms;
Design and development procedures, tools and techniques;
Depth, methods and means of testing.

Component 1 - Expertise

Algorithm for Examination of Critical Points of the System and Software Product

Assessing critical points of the system relies primarily on the accumulated experience and operational data of the System and Software Product engineers. Involvement of industry experts is also utilized.

The process includes:

Requirements gathering. Gathering information on the project's primary objectives, requirements, key business functionalities, and critical components from the customer, utilizing tools such as the Crash and Risk Assessment Form.
Failure assessment. Analysis of statistical and "historical" data of the System operation or in the absence of such data of its analogs. Determination of conditions and probability of risks of critical failures and loss of performance.
Critical Module Identification. Definition of Software Product Modules whose failure will lead to complete failure or impossibility to use business functionality i.e. are critical:
- Critical sensors and the parameters, events, and states they detect and/or measure,
- Time critical parameters/responses,
- Performances that are not executed without confirmation,
- Reliability issues of communication and communication channels,
- Data storage and transmission channels whose loss is not acceptable,
- Computation modules,
- Analytical modules,
- Internet access/autonomous mode,
- Power loss actions and procedures,
- Application unavailability/hung actions and procedures.
- Emergency shutdown procedures.

Based on this analysis, the decision is made and agreed upon on the modules and elements of the Software Product, requiring duplication to increase fault tolerance and reliability.
In addition, changes and adjustments are made (if necessary) to the Software Product architecture, development, and testing methods and tools.

Component 2 - Duplication

Methods of Duplicating Software Modules

Considering that among primary sources of errors and failures in a system's operation are:

software bugs,
hardware failures and malfunctions,
human errors,
third-party system malfunctions,

the principle of functional and economic feasibility formulated earlier should be applied under Integrated Reliability to elements, nodes, and activities that are critical to the operation of the System.

Practically speaking, this means the System's ability to have the properties required for use (not to be confused with desirable features).

The use of EDC technology provides for the following priority task in the design of the Software:

Fault prevention - prevention of failure or malfunction.

Considering that solving this problem requires mechanisms for error detection (indication) before issuing control commands, this establishes the main condition for addressing other reliability issues in the System, if necessary:

Fault removal - repairing, replacing, or updating components,
Fault tolerance - the ability to operate in the presence of faults or failures,
Fault forecasting - estimates of the possibility of failure and its consequences.

Critical elements of a System are usually among:

Business logic modules,
Calculation and mathematical modules,
Analytical modules,
Thirdparty integration modules,
Modules for exchanging information with a database,
Internet access modules,
Offline modules,
Emergency shutdown modules,
other business-specific critical modules.

The solution of the fault prevention problem for critical modules is achieved by means of different redundancy methods. The choice of duplication method takes into account: the module function, operation conditions of the System, and possible causes of failures and errors.

The following methods are suggested, amongh others, in the practice of duplication, taking f(x) as an example/abstraction of any business operation:

Functional Duplication Method

This method is implemented in Code by duplicating the critical function - f(x) with a duplicating function - f_D(x).

fig3
_{Figure 3. Model of Functional Duplication method}

The features of the code element implementing the function f_D(x) are as follows:

To be developed by another engineer or development team,
Developed using alternative algorithms,
Tests are developed with an alternative approaches.

Inverse Function Duplication Method

This method is implemented only for:

Functions having an inverse function,
Functions allowing to define the inverse function in a strictly defined range of operating values of the System.

Duplicating the critical function f(x) using the inverse function f^inv is shown below:

fig4
_{Figure 4. Model of Inverse Function duplication method}

Random Component Duplication Method

This method is implemented in Code only for functions for which the condition is true:

f(x) = f(x + Random_x) - f_R(Random_x)

where Random is a random argument offset value, so that f_R(Random_x) is a function compensating the random component in general or in a strictly defined range of operating values of the System.

Duplicating the critical function f(x) using the Random Component method is shown below:

fig5
_{Figure 5. Model of Random Component duplication method}

Duplication Over Time Method

The essence of this method is to repeatedly calculate the function f(x) with a time delay and compare the results.

Example: If the operating conditions of the System are such that the main causes of failures and errors include various kinds of electrical and electromagnetic disturbances, the method of duplication in time (optionally in combination with the above-mentioned ones) provides the implementation of fault prevention.

fig6
_{Figure 6. Model of Duplication Over Time method}

The methods of duplication are not limited to those mentioned. The key consideration when selecting duplication methods and their combinations is to take into account the function of the module, the operating conditions of the System, and the potential causes and sources of failures and errors.

Component 3 - Critical Settings

Parameters and Elements Requiring Constant Monitoring

When developing a software product, the use of EDC technology provides solution of the following important tasks of the Integrated Reliability concept implementation in the System:

Control direct or indirect information received from the System on the operation of elements that are critical for the implementation of the System function and/or man-made failures. Including direct, indirect, or duplicate control:

Boundary and allowable parameters (values) of variables obtained from the Systems;
Direct or indirect readings of the System sensors to the permissible values and/or deviations from the prescribed values.

Variable value data or sensor readings are stored in the Database as a matrix of feature descriptions of objects or numerical tables of values. In this case, all information received by the Software Product from the System is constantly monitored for compliance with the attribute and/or allowable values.

fig7
_{Figure 7. Initial data matrix}

fig8
_{Figure 8. Matrix of absolute deviations}

Duplication of System control functions in the Software Product is an important element of Integrated Reliability.

Control of time parameters/responses and their limits for available elements of the System.
Introducing mandatory confirmation to critical executive operations i.e. which are not executed without confirmation.
Creation of Statistical-Analyzing module of the Software Product which accumulates and systematizes statistical information of the System operation in order to predict and prevent emergency situations or loss of mandatory properties of the System.

Data of the Statistical and Analyzing Module are important for development of the System and creation of its improved modifications. Integrating and utilizing Artificial Intelligence to improve System performance.

Case Studies for EDC Technology

Reliable delivery ^{^Featured} for an online ticket marketplace, where delivery time is crucial, for a Chicago-based ticket seller, Score Tickets.
Reliable communication and storage components in unstable internet connectivity conditions, for leading Japanese hardware manufacturer Sodick.
Reliable transaction handling and autonomous operations for a widely-used POS system in Germany, developed by the German-based POS vendor iPOS (later rebranded as piOS).
Reliable communication with third-party services that may not always operate, for Texas-based e-learning company World Education.
Reliable alarm notifications for emergency and safety-related scenarios, for a Swiss-based watch vendor Aidwatch.

All Client Case Studies


	Please enter a valid email

Integrated Reliability

Utility model for software products

Contents

Introduction

Concept

EDC Technology Components

Component 1 - Expertise

Algorithm for Examination of Critical Points of the System and Software Product

Component 2 - Duplication

Methods of Duplicating Software Modules

Functional Duplication Method

Inverse Function Duplication Method

Random Component Duplication Method

Duplication Over Time Method

Component 3 - Critical Settings

Parameters and Elements Requiring Constant Monitoring

Case Studies for EDC Technology