Langmuir’s second test was stated as:

The effect is of a magnitude that remains close to the limit of detectability; or, many measurements are necessary because of the very low statistical significance of the results.

The first part of this statement tracks closely with the first test – the effect, or output, is barely measurable. This can be seen as a difference between a baseline, or control, and the effect. The question is one of having the ability to discern between signal and noise. Statistical analysis of the data is required to tell if indeed a signal exists or if the data is simply noise.

Statisticians love to talk about significance. But what makes a result statistically significant? In essence significance is similar to setting a bar one must jump over. The lower the bar implies a lower significance of clearing the hurdle. For example, if the bar is set at knee height, it is relatively simple to step over it; if the bar is over your head and you clear it, then it is a significant result.

We attribute probability values to the level of significance. If you can clear the bar 50% of the time, we say that the level of confidence is 50%. If however, you can clear the bar only 5% of the time you attempt it, the level of significance is 5%; we say that we are 95% sure (the level of confidence) that this is not some random result. The level of significance and level of confidence are complimentary and sum to 100%.

Scientists rarely accept anything with a level of confidence less than 95%, and frequently demand higher levels of confidence. For example, many diagnostic tests are required to have a 98% or higher level of confidence. For example you want a low chance of a “false positive” for your home pregnancy test. We want drugs to be safe and effective, so we demand a higher level of confidence, say 99+%. I would hope that the decision to make a nuclear counterstrike would have a much higher level of confidence approaching metaphysical certainty, or approaching 99.999+%. Because the effects of making an error can be devastating.

One major problem Langmuir identified is found in the second half of the test: “*many measurements are necessary because of the very low statistical significance of the results*.” Low significance means low confidence. Many researchers say the results are inconclusive and they need more data. Why?

Many standard statistical tests can be performed with minimal data – this is the whole basis of Statistical Design of Experiments (maximizing the useful information from the minimum number of experiments). In the real world of industrial R&D, each experiment costs resources (money, time, manpower), and we want to conserve resources.

What does additional data buy you? Simply stated, it buys statistical significance. It is a function of the mathematics. One of the basic statistical tests is the t-test which measures the difference between two populations. The two population t-statistic is based on the difference between the means of the two populations and their standard deviations. This measures the overlap between the two populations – the more they overlap, the lower the level of confidence that the two populations are different. If error bars overlap, it implies that the variability is sufficiently high as to make any differences a random occurrence. This is a standard test and is covered in many statistics textbooks.

The problem arises when one has too much data to analyze. As the number of data points rises, there will be very little change in the mean value. However, the standard deviation will become smaller as the data set increases; think of this as getting a more precise measurement. At some point you have enough data so that any two populations will show up as statistically significant.

I saw this first-hand about 20 years ago in an industrial R&D department. A customer was complaining about our product – it simply was not performing as it always had. Tech service had investigated and found a very subtle difference between lots that worked well and those that did not – it came down to the variability of a raw material, even though every lot of raw material was near the middle of the spec. The problem was that they had so much data, that the “good” and “bad” lots tested as significantly different *even though they were well within the raw material spec. *Further investigation showed the problem was actually with the customer’s equipment.

The lesson is that a request for more data is more often than not an admission that the results were not statistically significant, and the problem could be fixed by inflating the number of data points until a significant result is found.

That is not good science; that is fraud.