Design of Experiments for Process Validation

Used as one of the statistical tools for validation, design of experiments can help identify which factors need to be controlled in order for a system or product to pass the ruggedness test.

Mark J. Anderson and Paul J. Anderson

Design of experiments (DOE) has become an essential tool for the validation of medical manufacturing processes. A good description of why this statistical technique should be used is the assertion that processes "should be challenged to discover how outputs change as variables fluctuate within allowable limits."1 As an example of the benefits that such a validation tool can provide, this article describes a DOE that was run on a particular durable medical device known as a paraffin heat-therapy bath.

Figure 1. One of the paraffin therapy bath test units, which holds a gallon of molten wax.

The Therabath paraffin therapy unit (WR Medical Electronics Co.; Stillwater, MN) holds one gallon of molten paraffin wax and is used by osteoarthritis patients during physical therapy (Figure 1). To help loosen their joints, patients dip their hands into the heated paraffin, which is then allowed to slowly solidify into a wax glove. Oils in the wax help keep the heat at a comfortable level, facilitate removal of the glove, and moisturize the skin. To enhance the perceived benefit to the skin, vitamin E and various scents and colors are added for this application.


Six factors, identified by letter, were tested at low and high levels: the ratio of two component waxes (A); the ratio of wax to oil (B); the supplier of wax (C); the amount of dye (D); the amount of perfume (E); and the amount of vitamin E (F). The amounts of vitamin E, dye, and perfume were very small in relation to the amounts of wax and oil.

If every combination of factors had been tested for a full two-level factorial design, a total of 64 experiments (26) would have been conducted. The investigators instead chose to do a highly fractionated (1/8), two-level factorial design, for a total of eight runs (26—3). All six factors were given ratings for every test run.

Running a highly fractionated design has its drawbacks. Although the testing is completed faster, the amount of data generated is proportionately reduced as well. And while main effects may be obvious, they will be aliased (perfectly correlated) with interactions between two or more of the other factors included in the design.

Ten employees made up the subject panel. In order to reduce subject bias and to counteract reduced sensitivity to heat over time, the experiments were blind and in random order. Also, rather than dipping an entire hand into the paraffin, as patients would, subjects dipped only one finger into each bath and used a different finger for each test.

Following each dip, subjects noted their sensory evaluations of color, scent, heating, oiliness, and quality of the wax glove from one (worst) to nine (best). The results were analyzed by individual and then averaged. This method provides a powerful tool for discriminating changes in performance of the device, assuming that each individual's ratings are fairly consistent from one combination to the next. If needed, analysis could also reveal the relative differences between individuals.

To fulfill the purpose of validation, the outcome of the experiment should reveal no change in response caused by a variation in one of the factors. Such a lack of results would prove the system's ruggedness. Conversely, a significant result—which is referred to as a failure in this context—usually requires more experimentation to reveal the true cause or causes.


Analysis of the DOE with a commercially available statistics package revealed that user perceptions resulted in significant results, which means that the paraffin formula did not pass the ruggedness test. Figure 2 shows a half-normal probability plot for color. Half-normal plots show the absolute value of an effect on the x-axis as square points; estimates of error are displayed as triangles. The biggest effects, those to the right, are most likely to be real (significant). The effects grouped near the zero effect level presumably occur by chance and thus represent experimental error. The y-axis is constructed to be linear in the normal scale, so the near-zero (insignificant) effects fall on the line emanating from the origin (0, 0).

Figure 2. A half-normal probability plot for color shows that 99% of the effects are expected to be less than 1.6. Effects below this level are lined up in normal fashion, but the change to color registered an abnormally large effect of 2.39.

One effect stands out in Figure 2—D, the level of dye. Standard statistical analysis of variance (ANOVA) reveals a probability of less than 0.1% that an effect this significant could have been caused by chance. Although there may have been aliased interactions, we made the assumption that the responses for color would be affected only by the amount of dye (D). Obviously, the panel preferred higher levels of dye.

Figure 3. A half-normal probability plot for scent shows one significant value, although not as big an effect as the one for color (0.86 versus 2.39).

For scent, factor E—the amount of perfume—was the most important effect (Figure 3), but it did not stand out as strongly as did color. Also, a standard deviation value of 4.09 from bath 1 signals an unusual occurrence (outlier). When plotted on a graph that displays the t-value of each run—how many standard deviations apart a result is from what was expected—this outlier falls outside the recommended limits of ±3.5 (Figure 4).

Figure 4. A graph of ±-values that shows an outlier detected in the scent category.

Figure 4 resembles a control chart, with the 4.09 value falling outside the upper control limit. (If graphed on a bell-shaped curve, commonly used in statistics, the value would be near the positive value end where the curve has flattened out.) Outcomes within the upper and lower control limits (or standard deviation boundaries) represent common cause variability; those values outside the limits are likely due to special causes.

Bath 1 was assumed to be an outlier, since it's unlikely to have happened by chance. Further investigation revealed that the temperature in bath 1 was significantly high, generating more scent than usual.

After removing the outlier, perfume stood out even more clearly as the most likely cause of the effect on scent. ANOVA shows the probability of this effect happening by chance to be less than 1%. Again, it seems reasonable to make a leap of faith that factor E (perfume) was the cause for perceived changes in scent and not any aliased interactions. In other words, the panel preferred higher levels of perfume in the formula.

Figure 5. A half-normal probability plot shows there were no significant effects for the perception of heat.

Figure 6. A half-normal plot of effects for oiliness shows that a main effect and an interaction provided significant results. However, due to aliasing caused by fractionation of the factorial, further experimentation would be necessary to accurately determine the actual cause.

The statistical analysis revealed nothing significant for perception of heat (Figure 5). Perceptions of oiliness were significantly affected (Figure 6), but the aliasing of main effects made it impossible to draw any definite conclusions. For example, it makes no sense that factor E, the dye, would affect oiliness, but one of its aliased effects might. Unlike the results for color and scent, there was no obvious explanation for these effects. At this stage, the capability of the low-resolution design was exhausted. More experimentation was needed to uncover the true causes for the failure of the ruggedness test.


Some aliasing of main effects can be eliminated by adding a second block of experiments with all variable levels reversed (for example, high versus low amounts).2 This technique is called a foldover. The combined results normally remain somewhat aliased.

Before doing the foldover, dye (D) and perfume (E) were eliminated as factors based on the assumption that they affected only color and scent, respectively. Dye and perfume were set at their midpoint levels, and color and scent were dropped from further consideration. The addition of the eight foldover runs resulted in a full (i.e., no aliases) 16-run factorial for the remaining four factors, which avoided further aliasing.

Analysis of the combined data continued to show no significant impact on perceptions of heat. This was an important finding because, prior to the DOE, the manufacturer was concerned that users would be sensitive to variations in melt point caused by changes in the ratio of wax and oil. For this attribute, the formula passed the challenge of validation, since it was robust to expected variations.

Figure 7. A plot of combined results shows that effects for the glove are not significant, meaning that this attribute passed the ruggedness test.

Figure 8. A plot of effects for oiliness using combined data shows an unusual three-factor effect on perception (wax to wax ratio, wax to oil ratio, and amount of vitamin E).

Regarding the perception of the quality of the wax glove, the first experiment seemed to indicate some effects, but after the data were reviewed for the entire series of runs, including the foldover, none of the factors was shown to significantly affect user perceptions (Figure 7). Therefore, this attribute also passed the ruggedness test.

Figure 9. Three interaction graphs show the wax/oil relationship at low, medium, and high levels of vitamin E (left to right, respectively). Squares represent positive levels of wax to oil, triangles represent negative levels.

The final results on perception of oiliness (Figure 8) indicate a dependence on the combination of three factors: the ratio of the two component waxes, W1 and W2 (A), a higher ratio of total wax to oil (B), and the amount of vitamin E (labeled D in the plot). Although such three-factor interactions are very unusual, they are more likely in experiments that involve mixtures. The series of interaction graphs shown in Figure 9 demonstrate the complex behavior governing the perception of oiliness. In order to provide the best formula of paraffin wax, the highest-rated variables, as determined by the results of the experiment, should be combined in the product. This combination is most readily identified using a cube plot (Figure 10).

Figure 10. The cube plot shows the best combination (upper right front) for the three factors that affect oiliness. The determining factor is the highest numerical value, not the plus or minus symbols.


Based on the results from the two-step DOE, several product recommendations were made in order to arrive at a cheaper, improved paraffin blend.

  • The cheapest supply of raw wax material can be used, as this variable did not significantly affect any of the tested perceptions.

  • More color and scent should be added, which may also help mask the variability of native colors and scents.

  • The amount of vitamin E should be reduced, and the ratios of W1 to W2 wax and of wax to oil should be increased.


This study provides an example of how to apply a two-level factorial DOE to validation testing, and demonstrates the flexibility of the approach should the validation fail. In this application, the use of foldover runs offers an insight into how variations in factors can affect processes or products.

DOE is just one of the statistical tools used in validation to challenge a system and identify which factors to control. Other tools, such as statistical process control, should also be employed to show that the system can produce consistent outputs over time and meet specifications with a high level of confidence and reliability.


The authors would like to thank Dave Sletten of WR Medical (Stillwater, MN) for doing the experimental work and Patrick Whitcomb of Stat-Ease (Minneapolis) for providing valuable advice on the setup and analysis of the DOE.


1. JS Kim and JW Kalb, "Design of Experiments: An Overview and Application Example," Medical Device & Diagnostic Industry 18, no. 3 (1996): 78–88.

2. DC Montgomery, Design and Analysis of Experiments, 4th ed. (New York: Wiley, 1997), 413.

Mark J. Anderson is a principal of Stat-Ease Inc. (Minneapolis) and WR Medical Electronics Co. (Stillwater, MN), and Paul J. Anderson is vice president of research and development and a principal of WR Medical Electronics Co.

No comments: