# September 28, 2009

## Does van Wijk et al. 2006 really prove the memory of heavy water?

In a word, no. Quite the contrary, as we shall see below.

This post was inspired by apgaylard‘s epic fisking of a paper by Lionel Milgrom (Milgrom 2009), where Milgrom talks about the scientific basis of homoeopathy. While perusing Part 3, one of the papers apgaylard discussed caught my eye: van Wijk R, Bosman S, van Wijk EP. (2006) Thermoluminescence in ultra-high dilution research. apgaylard judges it to be one of the „better“ papers; it contains a statistical analysis of experimental results. After I left a comment on his blog, apgaylard very kindly sent me the paper itself. I set to work, looking at the statistical analysis and reanalysing the complete data set, which the authors kindly provide in their Table 1. What follows are comments from a statistical point of view – they are independent of the method of thermoluminescence itself. I am not a statistician, but have an interest in statistical analysis and experimental design.

Van Wijk et al report some promising p-values, including one that is quite borderline (p=0.056) and several that are stunning (p=0.0001) But what is the effect behind the p-values? After all, significance levels are essentially arbitrary guide posts. With enough data, any relation can become significant, with too little data, significant relationships can be obscured. Typically, to ensure replicability, researchers not only specify p-values, but also the relevant test statistics from which the statistics were computed. Journals also often require authors to calculate the size of the observed effect relative to the variation that can be observed in the data.

For quantities of interest, such as the difference between the emissions of the second thermoluminescence peak for two substances, authors also often specify confidence intervals. This takes into account that the experimental values are merely based on one of many possible samples. The difference measured in the experiment is unlikely to be the “true” underlying difference, but we can get an idea of the range of values where we will find the actual difference. This is the confidence interval. For the 95% confidence interval, the probability that the actual difference will be within that interval is 0.95.

## The Experiment

The authors compare the behaviour of three substances, D2O, C15 D2O, and C15 LiCl, in three experimental conditions, which differ by the time which has elapsed between various stages of the experiment. In experimental condition A, delays are shortest, in experimental condition B, delays are longer, and in condition C, delays are longest.

D2O is the untreated carrier substance, deuterium oxide, C15 D2O is deuterium oxide which has been diluted 15 times in deuterium oxide. C15 LiCl comes from a solution of 0.1 g lithium chloride (LiCl) in 9.9 ml of deuterium oxide. The LiCl solution was diluted 15 times. In each dilution step, the experimenters took 1 part of the original solution, added it to 99 parts deuterium oxide, and succussed it mechanically 100 times. Although the C15 LiCl solution is unlikely to contain any LiCl molecules, homoeopathic theory assumes that the substance retains a memory of the LiCl.

If the authors had decided to include the original LiCl solution that formed the basis of C15 LiCl in their study, we would have had a nice 2 x 2 pattern of two factors, succussion, and presence / memory of LiCl:

- D2O: no presence or memory of LiCl, no succussion
- C15 D2O: no presence or memory of LiCl, succussion
- LiCl: presence of LiCl, no succussion
- C15 LiCl: memory of LiCl, succussion

(The authors acknowledge this issue in their discussion section.)

Since the LiCl condition is missing, the experiment becomes somewhat tricky to analyse. Ideally, D2O and C15 D2O should show similar thermoluminescence peaks, while the peaks of C15 LiCl should differ significantly from either. This would support the contention that there is something qualitatively different about C15 LiCl, namely a memory of lithium chloride. In order to establish this, we need pairwise analyses. This is an important point, we will return to it later.

So, we have three substances and three sets of experimental delays, which gives us a 3 x 3 design. The authors also intend to look at the interaction between substances and delays. This is a complex design. If we assume that the difference between peaks is at least equal to one standard deviation, then three samples per combination of substance and experimental delay should be enough to achieve a power of 90% for a significance level of 0.05, giving us a total of 3 x 3 x 3 = 27 samples. (Never mind that for something like homoeopathy, which runs counter to the laws of physics, the intended level should be set to something like 0.001 or better. This could have been achieved by doubling the number of samples.) In their experiment, the authors use with 27 samples.

There are two target variables, peak2, the number of photons (in millions) emitted during the second thermoluminescence peak, and p2/p1, the ratio between the first and the second peaks, which normalises peak height per sample and controls for overall levels of photon emissions.

## The Analysis

So, what do the numbers say? I entered all the data into a spreadsheet and imported it into the free statistical analysis software R. In the following discussion, I will give the R code and its output. All functions are in the R basic stats package. None of the output is edited unless otherwise stated – WYSIWYG.

Here‘s the summary of the distribution of peak2 and p2p1:

> summary(peak2)

Min. 1st Qu. Median Mean 3rd Qu. Max.

8.55 26.09 46.40 60.98 67.54 183.70

Standard deviation:

> sqrt(var(peak2))

[1] 50.13459

> summary(p2p1)

Min. 1st Qu. Median Mean 3rd Qu. Max.

2.250 4.565 9.180 11.480 15.070 49.630

Standard deviation:

> sqrt(var(p2p1))

[1] 10.00678

Min. is the minimum, max. is the maximum, 1st Qu. is the first quartile, the value below which 25% of the data lie, 3rd Qu. the third quartile, the value above which 25% of the data lie, Median is the middle data point (when sorting all data points by value), mean is the arithmetic mean.

The high maxima strongly suggest that the data contains outliers, i.e.highly irregular data points that should be excluded from analysis. The authors insist that these are not due to experimentation mistakes, so more data is needed to either reliably detect outliers or model the substantial intrinsic variation in the results. The standard deviation is almost as large as the mean for both variables.

Neither target variable is normally distributed, as the Shapiro-Wilks test shows (low p-values indicate that it is highly unlikely that the underlying variables follow a normal distribution)

> shapiro.test(peak2)

Shapiro-Wilk normality test

data: peak2

W = 0.8214, p-value = 0.0003288

> shapiro.test(p2p1)

Shapiro-Wilk normality test

data: p2p1

W = 0.7643, p-value = 3.503e-05

** **

A normal distribution is highly desirable, since this means that the data can be described by a Gaussian bell curve, which is well understood and easy to model. In order to achieve a normal distribution, the authors transform both variables by taking the logarithm. This is completely acceptable and common practice, as long as the transformation is documented and taken into account when interpreting the results. The resulting variables are now normally distributed:

> shapiro.test(log(peak2))

Shapiro-Wilk normality test

data: log(peak2)

W = 0.972, p-value = 0.6557

> shapiro.test(log(p2p1))

Shapiro-Wilk normality test

data: log(p2p1)

W = 0.968, p-value = 0.55

Now, to the piece de resistance, the actual statistical analysis. The authors state that they used a two-way ANOVA, which is entirely appropriate. But their terminology is strangely muddled. They say on page 11 of the pdf that they „fitted a two-way analysis of variances [sic] (ANOVAs [sic]) to the data …“. One normally fits regression models. ANOVA or analysis of variance can then be used to compare the amount of variation in the data explained by each factor. In the next paragraph, they start talking about a strange hybrid beast called an ANOVA t-test. ANOVA seeks to determine the amount of variation in a given data set that can be explained by the variation in a factor A, while the t-test compares the means of two normally distributed variables. As a result of these conceptual differences, an ANOVA and a t-test use different statistics for computing statistical significance (ANOVA: typically, the F-statistic, t-test: a statistic that follows Student‘s t-distribution).

Normally, psychologists and quantitative social scientists, who can be very peculiar about those things, require authors to give not just the p-value, but also the associated statistic. van Wijk et al only present the p-value for selected pairwise comparisons, which is far from the complete report usually required in the journals I frequent. So without further ado, here are the ANOVAs:

> peak2.aov <- aov(log(peak2) ~ experiment * substance,data=dat)

> summary(peak2.aov)

Df Sum Sq Mean Sq F value Pr(>F)

experiment 2 2.0142 1.0071 3.6186 0.047761 *

substance 2 4.4287 2.2144 7.9564 0.003343 **

experiment:substance 4 6.1870 1.5467 5.5576 0.004292 **

Residuals 18 5.0096 0.2783

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

> p2p1.aov <- aov(log(p2p1) ~ experiment * substance,data=dat)

> summary(p2p1.aov)

Df Sum Sq Mean Sq F value Pr(>F)

experiment 2 5.0236 2.5118 12.0894 0.0004695 ***

substance 2 2.6455 1.3228 6.3665 0.0081097 **

experiment:substance 4 4.3319 1.0830 5.2124 0.0057364 **

Residuals 18 3.7398 0.2078

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

„experiment“ stands for the three experimental conditions, „substance“ for the underlying substance (D2O, C15 D2O, C15 LiCl), and ‘experiment:substance’ for the interaction of the two. Both factors and their interaction are significant, hooray! But what does this mean? In order to determine this, we need to find out which experimental condition was different from the others, and which substances performed differently. In other words, we need to look for significant differences between factor values.

Now, I suspect that the authors used pairwise significance tests to check for significant differences. However, pairwise tests can give inflated estimates of actual significance, since they fail to correct for the fact that we are making multiple comparisons. Basically, the more tests we do, the more likely they are to come out significant, and we want to counter this bias. There are several well-established methods for doing this. Here, I‘ll use Tukey‘s Honest Significant Differences test. As implemented in R, it gives us not just the significance level of each difference, but the 95% confidence interval for the actual difference. I‘ve edited out those differences in the experiment:substance factor that compare substances across different experimental conditions – none of them were significant. In the following, „diff“ stands for the difference observed in the data, “lwr” for the lower bound of the confidence interval, and “upr” for the upper bound of the confidence interval. All observed differences are between logarithms of counts or ratios, not between the counts or ratios themselves. Significant comparisons (at p<0.05) are bolded and rough overall significance levels have been added.

> TukeyHSD(peak2.aov)

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = log(peak2) ~ experiment * substance, data = dat)

$experiment

diff lwr upr p adj

**b-a -0.6423134 -1.2770138 -0.007613058 0.0470476 (p<0.05)**

c-a -0.1590533 -0.7937536 0.475647080 0.8004902

c-b 0.4832601 -0.1514402 1.117960500 0.1554948

$substance

diff lwr upr p adj

c15licl-c15d2o -0.4113469 -1.046047 0.22335343 0.2496580

**d2o-c15d2o -0.9874758 -1.622176 -0.35277546 0.0024556 (p<0.005)**

d2o-c15licl -0.5761289 -1.210829 0.05857147 0.0789938

$`experiment:substance`

diff lwr upr p adj

a:c15licl-a:c15d2o -0.867528277 -2.3768018 0.64174520 0.5533167

b:c15licl-b:c15d2o -0.156263216 -1.6655367 1.35301026 0.9999856

c:c15licl-c:c15d2o -0.210249307 -1.7195228 1.29902417 0.9998634

a:d2o-a:c15d2o -0.143477910 -1.6527514 1.36579557 0.9999926

b:d2o-b:c15d2o -0.642910927 -2.1521844 0.86636255 0.8450816

**c:d2o-c:c15d2o -2.176038628 -3.6853121 -0.66676515 0.0021067 (p<0.005)**

a:d2o-a:c15licl 0.724050366 -0.7852231 2.23332384 0.7500002

b:d2o-b:c15licl -0.486647711 -1.9959212 1.02262577 0.9613905

c**:d2o-c:c15licl -1.965789321 -3.4750628 -0.45651584 0.0057900 (p<0.01)**

> TukeyHSD(p2p1.aov)

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = log(p2p1) ~ experiment * substance, data = dat)

$experiment

diff lwr upr p adj

**b-a -1.0534873 -1.60187954 -0.5050951 0.0003219 (p < 0.0005)**

**c-a -0.5966675 -1.14505978 -0.0482753 0.0317790 (p < 0.05)**

c-b 0.4568198 -0.09157249 1.0052120 0.1125449

$substance

diff lwr upr p adj

c15licl-c15d2o -0.4531218 -1.0015140 0.09527044 0.1161054

**d2o-c15d2o -0.7622188 -1.3106111 -0.21382660 0.0061931 (p < 0.01)**

d2o-c15licl -0.3090970 -0.8574893 0.23929520 0.3430453

$`experiment:substance`

diff lwr upr p adj

a:c15licl-a:c15d2o -1.0284670 -2.33250571 0.27557179 0.1936033

b:c15licl-b:c15d2o -0.1983622 -1.50240092 1.10567658 0.9997378

c:c15licl-c:c15d2o -0.1325363 -1.43657501 1.17150249 0.9999876

a:d2o-a:c15d2o -0.2003015 -1.50434022 1.10373728 0.9997183

b:d2o-b:c15d2o -0.4641427 -1.76818142 0.83989608 0.9342452

**c:****d2o-c:c15d2o -1.6222124 -2.92625113 -0.31817363 0.0088512 (p < 0.01)**

a:d2o-a:c15licl 0.8281655 -0.47587326 2.13220424 0.4308325

b:d2o-b:c15licl -0.2657805 -1.56981925 1.03825825 0.9978724

**c:d2o-c:c15licl -1.4896761 -2.79371487 -0.18563737 0.0184103 (p < 0.05)**

There are three blocks in each output. The first block tells us whether there are significant differences between each pair of the experimental conditions, the second block tells us whether there are significant differences between each pair of substances, and the third block tells us whether there are significant differences between substances in each of the three experimental conditions.

Looking at this analysis, it is clear where the significant differences come from:

- Peaks from experimental condition B are lower than peaks from experimental condition A
- Peaks from non-succussed D2O are lower than peaks from succussed D2O.
- In experimental condition C, D2O peaks are lower than peaks from either of the succussed substances, but not in conditions A and B.

Van Wijk and coauthors report the last pattern, but they overestimate the significance of results from experimental condition A.

The differences between the three experimental conditions A, B, and C, are delays introduced into the experimental paradigm. The delays are shortest in A, longer in B, and longest for C. If there were a straightforward effect of time delays, we would expect significant differences between C and A if there are differences between B and A, but this is not what happens.

There is no difference whatsoever between C15 LiCl, which is supposed to have the memory of LiCl imprinted on it, and C15 D20 – both in the overall analysis and in the analysis of each experiment. This flatly contradicts the predictions of homoeopathic theory.

In their paper, the authors show a graph (Fig. 3a) where the outcome for experimental condition A is as desired. The thermoluminescence curve of C15 LiCl is well below the curves for D2O and C15 D2O, and the curves of the latter two substances overlap substantially. However, this graph represents means of the relevant curves, and does not show the substantial variation in results which is evident in their table. Therefore, I seriously doubt that the actual qualitative patterns are as neat as the authors suggest.

## The Conclusion

So, in conclusion, the results of van Wijk et al. 2006 thoroughly refute the assumption that C15 LiCl will have significantly different thermoluminescence patterns than a similar D2O solution. It is not possible to cite this paper as a support for homoeopathy. Not only did it fail to show the results expected by homoeopathic theory, it is also seriously underpowered for the medium-sized effects seen in the data. The statistical analysis and reporting do not follow the standards that I am familiar with from experimental phonetics, linguistics, and psychology. But since the authors very thoughtfully provided their data, reanalysis was possible. This reanalysis broadly confirms most of the findings reported in the original article, albeit with much lower significance values.

As the authors suggest, the experiment needs to be replicated with two to three times the number of samples in each cell before we can draw firm conclusions about the significant patterns we do see. Therefore, this experiment is by no means the definitive replication and confirmation of Rey 2003. Instead, it has all the characteristics of a pilot study. It is quite common to perform pilot studies first in order to get an idea of the size of the expected effect. Once the effect size is known, the experimenter(s) can calculate the number of samples needed to assess significance. But even if the experiment is replicated with 6, 9, or even 12 samples in each conditions, the confidence intervals suggest that no significant memory of LiCl will be found.

## Edits

Any corrections are more than welcome. This is a fairly vanilla statistical analysis, and I’m sure it could be refined further.

No post-publication edits yet.

## References

Milgrom LR. Under Pressure: Homeopathy UK and Its Detractors. *Forsch Komplementmed*. 2009 September;16(4):256–261. Available from: http://dx.doi.org/10.1159/000228916

van Wijk R, Bosman S, van Wijk EP. Thermoluminescence in ultra-high dilution research. Journal of Alternative and Complementary Medicine (New York, NY). 2006 June;12(5):437–443. Available from: http://dx.doi.org/10.1089/acm.2006.12.437.

For statistics info, see (apart from Wikipedia) Statsoft’s online statistics textbook, one of the best resources of its kind.

Tnelson said,

October 1, 2009 at 5:53 am

I usually don’t post on Blogs but ya forced me to, great info.. excellent! … I’ll add a backlink and bookmark your site.

JimmyBean said,

October 1, 2009 at 8:56 am

I don’t know If I said it already but …Great site…keep up the good work. 🙂 I read a lot of blogs on a daily basis and for the most part, people lack substance but, I just wanted to make a quick comment to say I’m glad I found your blog. Thanks, 🙂

A definite great read..Jim Bean

RobD said,

October 6, 2009 at 11:44 pm

Hey, I found your blog while searching on Google your post looks very interesting for me. I will add a backlink and bookmark your site. Keep up the good work! 🙂