Risk-Based Validation of Commercial Off-the-Shelf Computer Systems 4

Risk management overview
Risk management of a commercial computer system starts when the system is specified and purchased, continues with installation and operation, and ends when the system is taken out of service and all critical data have been successfully migrated to a new system.

Figure 2: Risk management (used with permission).
The approach we take is to divide risk management into four phases, as illustrated in Figure 2. The phases include risk analysis, risk evaluation and assessment, and ongoing evaluation and control. Risk analysis. Define computer system components and software functions. Identify potential hazards and harms using inputs from system specifications, system administrators, system users, and audit reports.
Risk evaluation and assessment. Define the severity, probability, and risk of each hazard, for example, by using past experience from the same or similar systems. Determine acceptable levels of risk and identify the hazards that would need mitigation to reach those levels. Identify and implement steps to mitigate risks.
On-going (re)evaluation and control. On an on-going basis, evaluate the system for new hazards and changes in risk levels. Adjust risk and mitigation strategy as necessary.
These activities should follow a risk management plan and the results should be documented in a risk management report.
Risk analysis
The first step in the risk management process is the risk analysis, sometimes called risk identification or Preliminary Hazard Analysis (PHA). The output of this phase is the input for risk evaluation. Inputs for risk analysis include:
  • specifications of equipment including hardware and software;
  • user experience with the same system already installed;
  • user experience with similar systems;
  • IT staff experience with the same or similar network equipment;
  • experience with the vendor of the system;
  • failure rates of the same or similar system (mean time between failures) and resulting system downtime;
  • trends of failures;
  • service records and trends;
  • internal and external audit results.

Inputs, for example, can come from operators, the validation group, Information Technology (IT) administrators, or from Quality Assurance (QA) personnel as the result of findings from internal or external audits.

Table I: Template for the identification of risks.
The project manager collects input on potential hazards including possible harm. For consistent and complete documentation, forms should be used. The forms should have entry fields to include relevant data on the individual who made the entry, risk description, possible hazards and harms, probability of occurrence, and possible methods of mitigation. An example is shown in Table I. Occasional problems and harms with computer systems include, but are not limited to the following:
  • Hard drive failure on local personal computers (PCs) or on the server computer can cause severe system downtime and loss of data.
  • Loss of network connectivity due to hardware failure, for example, the network interface card, can cause system downtime;
  • System overloads can cause a slow-down of operations and system downtime.
  • Inadequate vendor qualification or absent specifications on vendor support purchasing agreements can result in reduced uptime because of missing support—in the case of hardware, firmware, or software problems.
  • Inadequate or absent documentation of installation can make it difficult to diagnose a problem.
  • Inadequate or absent verification of security access functions can result in unauthorized access to the system.
  • An insufficient or absent plan for system backup can result in data loss in case of system failure.
  • Poor or absent documentation of hardware and software changes can make it difficult to diagnose a problem.
  • Inadequate quality assurance policies and procedures or inadequate reviews can lead to poor system quality.

No comments: