Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
P values are not as reliable as many scientists assume (2014) (nature.com)
83 points by e0m on Aug 30, 2015 | hide | past | favorite | 87 comments



From my experience, scientists, -at least in biology, where like in sociology you might have a lot of noise to deal with-, have an internal intuition that a single paper with a significant result does not mean that we have found the truth. The recent study which reported a reproducibility in sociology of about 36% strikes me as pretty accurate.

I think the scientific system can work with that. It means that if you build follow-up experiments based on a single paper there is a good chance that the experiment fails. In some way, the scientific system of publishing is self-correcting in this regard, because you can then cast doubt on the previous paper, which is easier to publish than if you only have a fresh negative result (p-value > threshold).


There is no way to know how many people tried to build a follow-up experiment which failed and was not published because the failure to replicate will usually be assumed to be due to some mistake, and even carefully finding p > threshold is not very publishable.


Luckily some disciplines have a journal of negative results :-)

Eg. http://www.jnr-eeb.org/index.php/jnr


A large amount of published results that are wrong is definitely something science can live with: we have to trade off Type I against Type II. But we should value accuracy: if we report something as being very, very unlikely if chance was at play, and it turns out that in fact (1) it'd be very likely even if the null hypothesis holds and (2) in fact even if P(D|H0) is low, P(H0|D) might be high... then what's the point in writing up all those fancy statistical analyses anyway? At that point significance testing becomes more of religious ritual and should either be discarded entirely or be amended.


If the p-values were accurate and averaged around 0.05, ~95% of results should be reproducible.

That only 36% were points to deep, fundamental errors.


No. P-values don't work that way and don't mean what you think they mean. Read OP or heck, any of the classics like "Why most published research findings are false" http://dx.plos.org/10.1371/journal.pmed.0020124

(36% may or may not be bad, but you can't know without additional stuff like power or prior probability of hypotheses being true; p-values have no intuitive meaning and aren't an answer to any question that people are asking, which is a major reason why Bayesian approaches can be useful. And from a Bayesian perspective, I find 36% totally unsurprising - if anything, substantially better than I had expected given the gross underpowering of most psych studies, the statistical-significance publication filter, and the dubiousness of most hypotheses.)


I agree, 36% ain't too bad. But,it requires that any literature you use in your research should have been reproduced a few times by other researchers.


A proper rebuttal would show what a p-value actually is and how it differs from what I claimed. Now, since a p-value is exactly what I previously claimed, you obviously can't do that. I'm not even sure what you are arguing against me here.


The p-value is the chance of a false positive. But you don't know what the rate of true positives is, or the rate of false negatives.

In a world where there are only false positives and true negatives, and people publish all positive and negative results, then reproduction of a paper should be 95%.

But the reproduction rate when there actually is an effect is not 95%. Depending on sample size, I might get a true positive 20% of the time and a false negative 80% of the time, or I might get a true positive 99.8% of the time and a false negative .2% of the time.

So the average reproduction rate, where an effect actually exists, can be almost any number between 5 and 100. There is no reason to assume it will be 95%.

So the average reproduction rate, where some effects are real and some are imaginary, will almost certainly not be exactly 95%, and that is not a problem in and of itself.

(And when you talk about an average p-value of .05, that sounds like only publishing positive results, which is blatantly going to fail reproduction. 100 false hypotheses -> 5 publications, all false positives -> 5% reproduction rate)


>In a world where there are only false positives and true negatives, and people publish all positive and negative results, then reproduction of a paper should be 95%.

This is the world p-value assumes and is therefore the only one worth considering in relation to my comment.

If an experiment is not well-formed then of course you won't see reproduction at the expected rate. This is what I'm referring to when I say that the low reproduction rate points to deep, fundamental flaws in the experiments.

I agree that the reproduction rate will never be exactly 95% (or 1 - p) due to the discrete nature of experimentation [that's why I used a ~ in front :)], but the reproduction rate of a well-formed experiment should very closely track 1 - p.


>This is the world p-value assumes and is therefore the only one worth considering in relation to my comment.

I'm not sure if that was clear enough. In that world, no one has ever had a hypothesis that was correct. The whole field is useless, measuring things that are wrong and getting the occasional false positive.

You can talk about that world if you want, but it has no connection to reality. It's not p-values that assume that world, it's your misunderstanding of p-values.

>If an experiment is not well-formed then of course you won't see reproduction at the expected rate. This is what I'm referring to when I say that the low reproduction rate points to deep, fundamental flaws in the experiments.

Experiments don't have to have enormous sample sizes to be well-formed. That's the whole point of having a cutoff value.

It's not like an experiment that reproduces 80% of the time disproves the result the rest of the time, it just doesn't quite reach .05 on those trials

>the reproduction rate of a well-formed experiment should very closely track 1 - p

I'm suspicious of this. I don't have time to do the math right now, but an experiment that averages .01 might clear a .05 hurdle far more than 99% of the time, and would definitely be well-formed. And if you set a hurdle at .01 it would only clear it half the time, but it would still be well-formed.


Hypotheses can never be proven to be correct. I don't want to be in any world where it is believed that a hypothesis is or could be correct.

This is a fundamental tenant of science. All that can be done is to reject hypotheses.

You (along with Gwern) have now claimed that I don't understand p-values, but you present no alternative understanding. The reason, of course, is that when you look at the mathematics behind p-value, it is obvious that it is exactly as I claim.

Edit to address your edit:

>I'm suspicious of this. I don't have time to do the math right now, but an experiment that averages .01 might clear a .05 hurdle far more than 99% of the time, and would definitely be well-formed. And if you set a hurdle at .01 it would only clear it half the time, but it would still be well-formed.

You are right that you need to be careful here about what you are comparing across instances. There will be variability since you are only sampling a distribution (most likely at a very low rate) and not observing the entire distribution (which, for continuous distributions, is impossible).


On a certain philosophical level you can never be absolutely sure of anything, and p-values are meaningless.

On a practical level, p-values are the chance that a correlation is reported where 'reality' does not have a correlation. This is not the same number as the chance that the result agrees with 'reality'.

You can reject the concept of objectivity, but you cannot reject that logic. So I have explained the alternative understanding fine, just go back and replace 'true' and 'false' and 'correct' with a philosophically-hedged version.


On a practical level, people may not be able execute a well formed experiment. I completely agree with that.

However, that doesn't change the meaning of the mathematics, only that your reality has diverged from what you originally intended/believed.

What is the meaning of the number that people call 'p-value' when it is not calculated on a well-formed experiment? I'm not sure if there is a general formula, but you may be able to find some meaning in a particular instance.


You're either defining "well-formed" as there being no such thing as a true hypothesis, or you have completely lost me. Either way I don't think there's anything more I can say.

p does not tell you how likely a result is to be true.


A well formed experiment tests only a null hypothesis.

p-value is exactly the probability that you observed X given that the previously stated null hypothesis was true at the time of observation. The value (1 - p-value) is exactly the probability that you will make an observation consistent with your hypothesis (ie. expected replication rate).

Wikipedia has a decent treatment that might help: https://en.wikipedia.org/wiki/P-value#Definition_and_interpr...


But the importance of a p-value is showing when it's not the null hypothesis.

The only time you get 95% reproduction is a result that says the null hypothesis is true.

You're entirely right about that specific case.

But this only happens when nothing correlates. (And almost no science has been done, because most things in fact don't correlate.)

A result that disagrees with the null hypothesis at .05 does not imply any particular chance of another result that also disagrees with the null hypothesis at .05

If there is no correlation, then replication will happen 5% of the time. If there is correlation, it will be somewhere over 5%, but no particular value.

When people talk about reproduction, they talk about that chance. It will only be 95% by coincidence.


>You're entirely right about that specific case.

In fact, this is the only case that matters. All other (valid) cases can be reduced to a single, null hypothesis design.

p-value is undefined for hypotheses that are not a null hypothesis. It is also undefined for hypotheses which do not hold.

Sure, you can walk through the motions, put some numbers together, and eventually produce a number between 0 and 1. However that does not mean you have computed a p-value. If you are testing a non-null hypothesis you have not computed a p-value. If you are testing a null hypothesis that doesn't hold, you have not computed a p-value.


The null hypothesis is where nothing happens. You're supposed to be showing evidence against it. If you redefine things so your "null hypothesis" is where something happens, and you're showing evidence for it, you have done something very very wrong, and you should not be using a .05 threshold either.


"The p-value is the chance of a false positive."

Nope, It's the chance of getting a result as this or more extreme under the assumption of the null hypotheses.


Let me join the club of people claiming that you don't understand p-values.

It's not clear if you are talking about the rate of reproduction for a subset of the possible experimental outcomes (those rejecting the null at the alpha=5% level) or for the whole set.

When the null hypothesis is true (remember that there are fields where this is the norm, v.g. ESP), you would only reproduce (reject again) 5% of the rejections.

Of course you would reproduce (non-reject for the second time) 95% of the non-rejections. The global reproduction rate would be 0.95 x 0.95+0.05 x 0.5=0.905 (90.5% doesn't look ~95% either).

When the null hypothesis is not true, the probability of reproducing (in either sense) the result of a test depends on the effect size.

If the effect is huge, the test will be rejected with probability ~100% and the result will be reproduced with probability ~100%.

Or maybe you mean by reproducing "getting a lower p-value" in the the second trial? If the null is true, the probability of getting p-value2<p-value1 is precisely p-value1. If the null is not true, it will depend on the effect size. If you assume the effect size is the observed one, you expect p-value2 to be smaller than p-value1 with probability 50%.


>If the null is true, the probability of getting p-value2<p-value1 is precisely p-value1.

Precisely. P-value is only defined when the null hypothesis is true.

Apparently everyone else is overlooking this fact.


I don't see how does it contradict anything that I (and everyone else) wrote.

You seem to agree that when the null is true and the original result was p-value=0.05, the probability of reproducing the result (getting p-value<0.05 on a second trial) is 5%.

This seems incompatible with your original claim: "If the p-values were accurate and averaged around 0.05, ~95% of results should be reproducible."

Could you explain exactly what do the following mean:

p-values were accurate (that the null is true?)

p-values averaged around 0.05 (that you're taking the subset of outcomes with p-value 0.05?)

~95% of results should be reproducible (that if you take the previous subset you will get p-value<0.05 always in 95% of them? or exactly 95% of the time in all of them?)


95% of the repeated observations that you make (in the same manner as the observations used to calculate a valid p-value of 0.05) will be consistent with the relevant null hypothesis.

What other meaning could there be? The result of an experiment is not a p-value, but a series of observations. Those are what need to be compared.


I guess the bit "results should be reproducible" made us think that you were talking about reproducing the previous results (i.e. if the null hypothesis was rejected in the first trial, obtaining again a rejection if the trial was repeated).

If I understand your point, you're saying: "If the null hypothesis is true then with 95% probability it won't be rejected. And, independently of the result of the first trial, if we do a second trial and the null hypothesis is true then with probability 95% it won't be rejected".

Which seems correct, but you might be overlooking the fact that it's not very interesting and unrelated to the discussion.


It's highly relevant to the discussion (interestingly titled: "P-value not as reliable as many scientists assume"), which is entirely about p-value, as I am describing to you the limits of p-value analysis.

That some perform calculations that are not p-value and call them p-value is not exactly my problem to solve. That others perform meta-analyses with numbers that others call p-values, but which aren't actually p-values isn't really my problem either.

I'll say it again. If you correctly measure (exercise left to reader) a p-value of 0.05, that measurement explicitly means that you expect that 95% of your future observations to be consistent with the hypothesis which you used to determine that p-value of 0.05.

Making future observations that are consistent with a known hypothesis is exactly what reproducibility refers to within the context of science.

If you expect 95% of observations (p = 0.05) to be consistent with previous findings, but only 36% are...you did not calculate a valid p-value (or are now testing something other than your hypothesis).


> I'll say it again. If you correctly measure (exercise left to reader) a p-value of 0.05, that measurement explicitly means that you expect that 95% of your future observations to be consistent with the hypothesis which you used to determine that p-value of 0.05.

What do you mean with "measure a p-value"? You make your observation, calculate a statistic (a function of the observation), and look at the distribution of that statistic under the null. The p-value is, by definition, the percentile of the value you got in that distribution (which might or might not be the actual distribution).

You want to check if a die is loaded to yield 6 more often than it should. The null hypothesis is that the die is fair. You can calculate the distribution for the number of 6's in 3 rolls (0: 58%, 1:35%, 2: 7%, 3: 0.5%). You roll the die three times, you get three 6's. The p-value is 0.005. Do you agree? The p-value is 0.005 whether the die is fair (the null hypothesis is true) or loaded. Do you agree?

> Making future observations that are consistent with a known hypothesis is exactly what reproducibility refers to within the context of science.

Scientific experiments are usually about rejecting the null hypothesis. For example, the null hypothesis might be that there is no Higgs boson and the peak observed in the LHC data is just noise. They made their analysis and rejected the null hypothesis (p-value less than 0.000001, do you think they calculated it properly?). In this context, reproducibility means "finding the Higgs boson again if the experiment is repeated" and not "repeat the experiment and get a result consistent with the null hypothesis".

According to your description of the limits of p-value analysis, the only conclusion that physicists should get out of the experiment is that if they do it again they should expect to get results consistent with the null hypothesis (i.e. no Higgs boson) with 95% probability. But they see it as evidence that the null hypothesis is false and the Higgs boson real.


Measuring a p-value is equivalent to calculating a p-value (ie. calculate the conditional probability P(X|H)).

I don't really agree that your die experiment is well-formed. For one, you are grossly under-sampling. It's known a priori that there are at least six possible outcomes, yet you are only considering three rolls, so you don't even have the possibility of observing each distinct value even once.

The p-value of a well-formed experiment should converge towards a fixed value as more observations are made. You will experience variance in the computed value due to the inherently discrete nature of experimentation. This will be especially pronounced for the first observations that are made.

I do not know if the Higgs boson experiment is well-formed. If it is well-formed and their null-hypothesis is true, their p-values will trend towards 1.

If their null-hypothesis is not true then the p-values do not mean much and will trend towards 0.

>In this context, reproducibility means "finding the Higgs boson again if the experiment is repeated"

"The Higgs boson exists" is not a valid hypothesis. Usually the null-hypothesis is "the explanation is measurement/background noise". Since that is really the only valid null-hypothesis, it is most likely what they are using.


It's clear that you have your own concept of a p-value, which is quite different from the one used by all the other people (including the proper interpretation and the usual misinterpretations).

You disagree with all the provided examples, but you have not given any concrete example of how the p-value would be used in a "well-formed experiment" (another concept that seems unique to you).

Of course you're free to redefine concepts as you please, if it makes you happy or it is useful to you in any other way.


I have redefined nothing. Go pull the the Higgs data; it will be as I say. Go read how to form experiments and calculate p-values, nothing will be substantially different than what I have said here.


I already explained a few messages ago that the null hypothesis was indeed that there is no Higgs boson and the peak observed in the LHC data is just noise. They made their analysis and rejected the null hypothesis (p-value less than 0.000001).

"If you correctly measure a p-value of 0.05, that measurement explicitly means that you expect that 95% of your future observations to be consistent with the hypothesis which you used to determine that p-value of 0.05."

Did they correctly measure a p-value of 0.000001? They (and everyone else, apart from maybe you) think that they did.

Do you think they expect 95% of their future observations to be consistent with the null hypothesis (that they used to determine that p-value)?

I would say they were quite confident that the observed peak was not noise, and therefore they expected the signal to be there again if the experiment was to be repeated, rejecting again the null hypothesis. Which is why they announced they had discovered the Higgs boson. But maybe you can convince the Swedish Academy of Science to take Higgs' prize back...


Here is one paper: http://arxiv.org/pdf/1207.7214v2.pdf

Look at Figures 8 and 9. They show the p-value (at whatever level of data was collected when the paper was written) over the parameter space that is being searched. You can see that the values observed have a clear separation -- most are close to 1 (null-hypothesis holds) with just one significant dip towards 0 (null-hypothesis doesn't hold). If you were to animate this graph with the p-values over time (as more observations are made), you would see the trend towards 0 or 1 much clearer.

The Boson experimenters would expect a (1 - p) reproduction rate for the next observation made (if the null-hypothesis holds). That is, the next observation has a p probability of fitting within the parameters of the null-hypothesis and (1-p) probability that it is inconsistent with the null-hypothesis. Why would they expect that? Because the math involved in telling you whether or not that is what you should expect is exactly what p-value calculates (again, assuming a well-formed experiement -- which the Higgs experiments probably are).

But again, when the null-hypothesis doesn't hold, p-value tells you very little (it's actually undefined in the math).


> But again, when the null-hypothesis doesn't hold, p-value tells you very little (it's actually undefined in the math).

The p-value is well defined whether the null hypothesis holds or not. You calculate it assuming it does. There you go, you have a properly calculated p-value. That's what physicists do:

"Taking into account the entire mass range of the search, 110– 600 GeV, the global significance of the excess is 5.1 σ, which corresponds to p0 = 1.7 × 10−7."

You see, they have calculated a p-value. Does the null hypothesis hold? I don't think they had any expectations consistent with the null hypothesis being true before the experiment. After the experiment they clearly think that the null hypothesis is false:

"These results provide conclusive evidence for the discovery of a new particle with mass 126.0 ± 0.4 (stat) ± 0.4 (sys) GeV."

They don't see any problem in stating a p-value and rejecting the null hypothesis at the same time (in fact, it's because the p-value that they calculated is very small that they conclude that the null hypothesis doesn't hold). Apparently you see a problem, because if the Higgs boson exists and produces the signal in the experiment then the null hypothesis is false and all the p-value calculations they did to get to that conclusion are "wrong".

Anyway, I have no need to convince you of anything. I can live with people being wrong on the internet.


Well, before you go, I implore you to look into the actual computation and theory of 'p-value'.

A p-value is simply P(X|H). P(X|H) only means something when H is true. If H is false, P(X|H) tells you nothing. Since H is your null-hypothesis, if it does not actually hold in the real-world, P(X|H) is meaningless.

If you read the paper I linked, they never explicitly call out the null hypothesis (nor do, I believe, they show the work for their calculations). There should be another paper somewhere that describes exactly what it is, in the terms I am using. So, phrases like, "[t]hey don't see any problem in stating a p-value and rejecting the null hypothesis" make me think you have no idea what you're talking about.

The null hypothesis can never be 'rejected' (ie. p-value can never reach 0). I don't think you will find anyone working on the Higgs boson that will claim otherwise.


I think we agree that their null hypothesis is "there is a background, with events coming from all the known particles". I think we agree that their conclusion is "these results provide conclusive evidence for the discovery of a new particle". I don't see how can they say that there is a new particle without rejecting the hypothesis that there is no such new particle. Of course you can say that the null hypothesis can never be rejected (relevant Dilbert strip: http://dilbert.com/strip/2001-10-25) but then they can never discover a new particle either.

Regarding p-values in general, your definition is the same I've been using all along. But I don't think it is meaningless when the null hypothesis does not hold. The meaning is clear: "the probability of getting a value for the statistic as high as the observed one if the null hypothesis was true". For example, there would be one chance in several millions of observing the kind of data they found at the LHC if the Higgs boson didn't exist.

You might want to look into the theory yourself, because the notion of p-values trending towards 1 if the null hypothesis is true is nonsense. By definition, if the null hypothesis is true the p-value is uniformly distributed between 0 and 1. If you have at some point a p-value close to one (or to any other number for that matter) and keep adding data, in the long run it will still be uniformly distributed between 0 and 1.


If the null-hypothesis is true, every observation made should be consistent with it. This will result in P(X|H) trending to 1 (since there will be experimental variance). No other experimental design makes sense.

>The meaning [of p-value when H is false] is clear: "the probability of getting a value for the statistic as high as the observed one if the null hypothesis was true"

This is a logical fallacy. It is counterfactual to consider a world where the null hypothesis is true, when it is not.

In fact, this is precisely the feature of the universe that p-value based experimentation exploits and is essentially the only way for us to gain any information about 'reality'.

>By definition, if the null hypothesis is true the p-value is uniformly distributed between 0 and 1.

I don't think so. If that were true, p-value would be entirely useless.


You definitely do not know what a p-value is. When you wrote "P(X|H)" I though X was shorthand for T>T(X) where T is the statistic, not that you were referring to the actual data X.

P(X|H) doesn't have the properties you claim, anyway. P(X|H)=1 corresponds to the case where only one outcome is possible. In non-trivial cases, the more data you add the lower this number will be.

Assume H="you have a fair coin". You throw it once: heads. P(X|H)=P(h|faircoin)=1/2 You throw it again: tails. P(X|H)=P(ht|faircoin)=1/4. You throw it again: tails. P(X|H)=P(htt|faircoin)=1/8. I guess the experiment is not well-formed...


X can be any random variable that satisfies the requirements of the null-hypothesis.

A more appropriate variable for your experiment would probably be the ratio of heads to tails (may need to add a bias to avoid division by 0).

"you have a fair coin" is not a hypothesis, at least not a well-defined one.


Ok, so you're thinking about a random variable which converges to some value when the null hypothesis is true. This is fine, but it has nothing to do whatsoever with p-values.

Let me say that your notation is not very appropriate. It makes no sense to say that P(X|H) converges to 1. If you expect X to converge to C if the null hypothesis is true, you can simply say X->C. A proper notation involving probabilities would be P(|X-C|>epsilon)->0 for any positive epsilon (convergence in probability) or maybe P(X->C)=1 (convergence almost surely).

Taking as you suggest X=(#tails/#heads), you expect that X->1 if the coin is fair (I'm not sure why you find this is not a well-defined null hypothesis, but I don't really care). However, P(X)<1 for every X. In fact, P(X=1)->0 as the number if trials increases (X will get closer to 1 on average, but getting exactly 1 will get more and more unlikely).

As I said, you're free to prefer your converging statistics and your well-defined null hypothesis. But you should be aware that people are talking about something completely different when discussing things like the 1e-7 p-value in the Higgs boson discovery or the reproducibility of statistically significant results.

EDIT: Another example, maybe better-defined: a random variable distributed (under the null hypothesis) x~Normal(mu=0,sigma=1). Let's say you take N samples (I let you choose the number, so I don't pick one which is not good enough).The statistic is the mean X=(x_1+x_2+..+x_N)/N. If the null hypothesis is true, X->mu=0. You get X=1/sqrt(N). What's your "p-value" in that case?


> If the null-hypothesis is true, every observation made should be consistent with it. This will result in P(X|H) trending to 1

Every observation being consistent with H doesn't mean that for each event X that occurs, the conditional probability of X given H will be, or "trend toward", 1. Assuming a perfectly deterministic universe, the P(X|everything else that is true) will be 1 for every X that occurs, but that doesn't mean P(X|H) for any particular true proposition H will be anything like that.


>Every observation being consistent with H doesn't mean that for each event X that occurs, the conditional probability of X given H will be, or "trend toward", 1.

Agreed. If you plot P(X|H) (computed over all observations) over time, in a well-formed experiment the line will trend to 1 if the null-hypothesis is true and to 0 if the null-hypothesis is false.

It really is that simple.


> A p-value is simply P(X|H). P(X|H) only means something when H is true. If H is false, P(X|H) tells you nothing.

If you know H is false, P(X|H) tells you nothing. But then H wouldn't be a hypothesis, null or otherwise.

If you don't know whether H is true, but you do know something about X, P(X|H) tells you something useful about whether the positive hypothesis to which H is the alternative has an effect apparent in the world to explain.

> The null hypothesis can never be 'rejected' (ie. p-value can never reach 0).

Rejection of the null hypothesis does not mean p-value = 0. Scientific progress is not based on logical certainty, but rather practical utility.

Necessary truths are the domain of pure logic, not empirical science.


What is the interpretation of a p-value = 0 then? Empirical science can never reject any theory, it is not powerful enough. At best it can provide a selection of least worst explanations.

The H in P(X|H) does not mean 'assumed to be true', it means 'is in fact true'. If H is in fact false, it is counter-factual to assume it is true and therefore any conclusions drawn from the assumption are invalid. This is independent of belief in H.

>Necessary truths are the domain of pure logic, not empirical science.

Science can never deliver truth, which is why it can never truly reject anything (including null-hypotheses).

More generally, this is referred to as the problem of induction.


It is not that P-values are now bad by definition. It's only that they are many times wrongly intepreted. Putting too much confidence in P-values only might result in some wrong conclusions. And this is what some meta analyses discover. Many scientists try hard only to reach the "golden" <0.05 in order to claim discovery and publish it. This is why there is so many papers that misteriously cluster around 0.05...


There's also the systemic effect of prioritizing particular p values in that negative results are omitted leading to replication bias across the community.


Scientists have to do their work in a system that incentivizes bad science. How many people actually get to do their work in an environment that isn't hostile to them?


a) Are you serious about that second question and b) if so, can we discount thermodynamics in our answer? Otherwise it's kind of boring.


We can restrict ourselves to social factors. Nature isn't hostile, it doesn't have human intentionality like that. Seems to me we make work unpleasant for everyone in the misguided belief that people work harder for it.


Isn’t a main problem with p-values that you don’t know whether significance (low p-value) is a result of big effect and small sample or big sample and small effect. This is why you also need a measure for the effect, for example the distance of the two measurements in terms of standard derivations.


I agree with TFA that p-hacking is a bigger problem.

Low p-value <=> null hypothesis is unlikely.

Choose a shitty null hypothesis ("aliens did it!", "everything is Gaussian", etc) and you trivially get low p. Peer review checks this to some extent (you won't get away with "aliens did it") but there's a large gray area of null hypotheses shitty enough to give low p but not shitty enough to be rejected by peer review. Choosing the hypothesis after-the-fact is the most common strategy because it's undetectable except by repeating the experiment, which is hard.


That is a separate issue.

The main problem with p-values is that, without further information, one cannot infer from them how likely it is that a result is genuine.


I'm probably commenting too late to get my question answered, but here goes: the article has a pretty picture where they show how likely your p-values will mislead you depending on how likely the null hypothesis is. For instance, they say if you think that the null hypothesis has a 50% probability of being right, and you get p=5%, then there's still a 29% chance the null hypothesis is true. But according to my calculations, the right number should be 1/21 = 4.8%. What am I missing here? Or are they wrong? My calculations are below:

Curious George has 200 fascinating phenomena he wishes to investigate. In reality, 100 of those are real, and the other hundred are mere coincidences. The experiments for the 100 real phenomena all show that "yes, this is for real". (I'm assuming no false negatives.) Most of the 100 experiments that test bogus phenomena show that "this is bogus", but 5 of them achieve a significance of p=5%, as expected. George then runs of to tell the Man in the Yellow Hat about his 105 amazing discoveries. If Yellow Hat Man knows that half of the phenomena that capture George's attention are bogus, he knows that 5/105 = 1/21 = 4.8% of George's discoveries are likely bogus, even though he doesn't know which ones.


Assume that you're sampling from a normal distribution with known standard deviation sigma (1 for simplicity) and unknown mean mu. To test if the mean is larger than (the null hypothesis) mu=0 you can check if the observed value is larger than 1.64 sigma (for the 95% confidence test). So if your observation is larger than 1.64 you reject the null hypothesis.

Your calculation would be correct only if the assumption "no false negatives" is approximately valid. This is the case when the true value is large in terms of sigma (say mu=6). Then for the 100 cases with mu=0 you'll reject the null 5 times on average, and for each one of the 100 cases with mu=100 you will reject the null (unless you're unlucky: there will be a false negative around once in 150000 trials).

But you're conditioning on p<0.05, not on p~0.05. It's easy to see that it's much easier to get p=0.05 if mu=0 (this is a 1.64 sigma event) than if mu=6 (it's a 4.36 sigma event). If mu=0, you will get on average 1 (out of 100) observation with 0.04<p<0.05 (i.e. 1.64<x<1.75). The probability of obtaining an observation on that range when mu=6 is very small (0.0004 out of 100). Almost 100% of the "discoveries" with p~0.05 will be false (when mu=6 you will get p-values around 1e-9).

When the true value of mu gets closer to 0, you cannot ignore the false negatives. For example if mu=0.1 the rejection rate will be quite similar to the mu=0 case (the probability of getting 0.4<p<0.5 is 1.2% and 1% respectively) and almost 50% of the "discoveries" with p~0.05 will be false.

Somewhere between the two extreme cases, there is a lower bound for this "false discovery rate".

See http://faculty.washington.edu/jonno/SISG-2011/lectures/sellk... and in particular figure 2.


I think I'm still missing something here. In particular, I'm still not getting 29% as a lower bound. I'm getting around 20%. If we compare mu=0 with mu=1.64, the probability density at x=1.64 is roughly 0.1 and 0.4, respectively, so the lower bound should be .1/(.1+.4)=1/5. No? Unless they were assuming something other than "two normal distributions with the same variance"?


You're absolutely right! My example was mainly for illustration, I was not sure that it would give exactly the same lower bound (but I was indeed surprised that it's below the 29% in those papers, which I thought a "hard" bound).

It seems the bound that you are calculating (that I have reproduced, R code below) was already published more than 50 years ago for this specific case of a normal distribution. See slide 10 in http://www.biostat.uzh.ch/teaching/master/previous/seminarba...

I have not really read the paper of Sellke et al. entirely, but it seems that the "calibration" they propose is more general but it makes some assumptions about the distribution of the p-value and it's therefore approximative.

  p=0.05
  null=0.5
  c0=qnorm(1-p)
  x=seq(0,5,0.01)
  y=100*dnorm(c0)/(dnorm(c0)+dnorm(x-c0))
  calib=100/(1-1/(exp(1)*p*log(p)))
  actual=min(y)
  plot(x,y,type="l",ylim=c(0,100),ylab="%null",xlab="mu1",bty="l",xaxs="i",yaxs="i")
  title(paste(null*100,"% null  p = ",p,sep=""))
  legend("topleft",c(paste("Sellke, Bayarri, Berger (2001) =",format(calib,digits=3)),
                   paste("Edwards, Lindman, Savage (1963) =",format(actual,digits=3))),
       lwd=2,col=c("red","blue"),bty="n")
  abline(h=calib,col="red")
  abline(h=actual,col="blue")
  grid()


I don't know R, but I found a few sites that happily run R code for me. I find the shape of that curve somehow pretty. 29% clearly can't be a hard bound, since we can get 4.8% by assuming no false negatives. I just wish I understood whether there is anything particularly natural about the number 29, or did they make their distributional assumptions for the same reasons you did: "mainly for illustration". If so, then the Nature article was terribly misleading by presenting that number as some kind of "speed of light"-type limit, because that makes p-values look worse than they really are. It seems that p-values are bad enough without making up more bad stuff about them! :)

Anyway, thanks for all your help. Your ability to dig up references (and pump out R code) at a moment's notice makes me think you are someone who knows quite a bit of statistics. I'll happily look at anything else you care to point me to.


Assuming no false negatives is just not an option :-) I think that can only happen if the situation is such that there can be no false positives either (i.e. the p-value when the null hypothesis is not true is always zero). EDIT: What I wrote is true only if the distributions under the null and the alternative are completely disjoint. You can actually have very low false negative rates if the distributions are not symmetric, and if you allow different distributions you can do even better: imagine the null hypothesis is x~Normal(0,1) and the alternative is x=C0=1.64 (exactly the cutoff value for 0.05 significance). If we get exactly p=0.05 then the probability of the null being true is 0%. I mean, we get p in [C0-epsilon C0+epsilon] with probability 1 under the alternative, but with probability->0 under the null as epsilon->0. Of course, this alternative is very unlikely and mixing continous and discrete distributions is always tricky. This is why it makes sense to make averages over prior distributions of the alternative.

As you can see in slide 14, there are multiple calibrations proposed under different assumptions. I agree it is misleading to give one as the "real" error I rate, but it's interesting that all of them are giving rates well above the nominal alpha rate. EDIT: note as well that this is for the case where in 50% of the cases the null hypothesis is true(nowhere in the calculation of p-values do we consider how often the null hypothesis is true, but obviously if it's always true 100% of the significant results will be false positives and if it's never true 0% of the significant results will be false positives).

In slide 11 there are other calculations for the normal case, two-sided test this time. But instead of looking for the mu1 giving the lowest bound, they calculate the aggregate error rate making some assumptions about the distribution of the mu1. For example, if I understand correctly the results, assuming mu1 is normally distributed around mu0=0, if you get a p-value=0.05 (in the two-sided test, some modifications are required to the calculation we did) you should expect the null hypothesis to be true at least 32.1% (if the distribution of mu1 is very concentrated around 0, the 50% rejection rate on the left side of the chart dominates, if the standard deviation if very high the region of almost 100% rejection rate far from mu0 at the right of the chart dominates, for some intermediate standard deviation one will hopefully get the 32.1% lower bound).

Unfortunately, I think the assumption behind the nice result 1/(1-1/(e p log(p)))) is that p-values follow a beta distribution when the null hypothesis is not true and I don't think there is a clear interpretation of that.


Ok, so the pretty result is mostly arbitrary. Fair enough. Re: false negatives... you seem to be living in a world of bell curves, or at least a mostly continuous world. I can easily make (very contrived) experiments where false negatives just don't happen. For instance: I have two coins. One is a perfectly fair coin. The other is a two-headed coin. You see me flip one of them. The null hypothesis is that I flipped the fair coin. A false negative means deciding the coin is fair but it's really not. This will never happen, because you will only decide that if it lands tails, and then it must be fair. (If I only flip it once, the false positive rate is something like 1/3, not 0.) But this is probably much too contrived for your taste, and maybe even for mine. But it's almost 5am, and I must go to sleep now, or else it will get bright soon, and I never will. I now appreciate the value of the 20min procrastination setting.


I agree on your point, if we are sufficiently creative we can get many extreme results. For example, I made an addition to the first paragraph of my previous comment, that you might have missed, giving an example where the probability of the null hypothesis being true when p=0.05 is zero (or arbitrarily small, if we replace the discrete probability lump under the alternative hypothesis by a continuous distribution which is concentrated enough). I also added a comment on the second paragraph, by the way.

One minor comment on your example. If H0:fair coin and H1:two-headed and the statistic is the number of heads h, I cannot reject (at the 0.05 level) the null when n (the number of flips) is small even if I'm only getting heads. For one flip, p[h=1|H0]=0.5. For two flips p[h=2|H0]=0.25. For n>5 you will of course reject the null hypothesis for every case where H1 is true (and for ~5% of the cases where H0 is true). There will be no false negatives. But I guess you have noticed that this doesn't help with the false discovery rate in this example: when H1 is true the p-value will be very small (1/2^n) so if the observed p-value is ~0.05 (or any other value larger than 1/2^n) then it's for sure a false positive (because there will be at least one occurrence of tails).

Ok, enough time wasted on this subject :-)


> so if the observed p-value is ~0.05 (or any other value larger than 1/2^n) then it's for sure a false positive (because there will be at least one occurrence of tails).

Good point! I didn't think of that.

> Ok, enough time wasted on this subject :-)

Even better point!


First of all, thank you for responding, I expected that no one would. Second of all, you point out that "no false negatives" is not always realistic. Fair enough. That means that 4.8% need not be the right answer. But at least it's clear how I got it. I still have no clue how on earth they got 29%, rather than 28% or 31%. Am I missing something?


1/(1-1/(e p log(p))))

This is formula (3) in the linked paper. I agree that the derivation is not obvious, but probably you should get the same result using the calculation I described for different values of mu and looking for the minimum.


thanks!


Great article. I'm not sure that replication itself will solve the problem since Type 1 error rate requires asymptotics. We'd have to run many replications and then show convergence. That'll be broadly cost-prohibitive for all but the most important conclusions. Lower thresholds probably won't do it either. Right now, the only solutions I see are:

a) Baysian methods

b) Fisher's single H hypothesis method

c) Tukey's Exploratory Data Analysis method.

d) All of the above.


I don't see why (e) teaching scientists to be statistically literate so they don't abuse or misunderstand these tests, and/or (f) focusing on reproducible results and shaming researchers with sloppy methodology, wouldn't work. The hypothesis test has known limitations, but it's not clear that we should blame null hypothesis tests for people mis-using them, when researchers untrained in stats are just as likely to mis-use any method you give them.


Relying on effect measures and their confidence intervals, rather than relying on p-values as a "Yes/No" threshold should likely be on your list, especially as an interim step that should be easy for those who still want p-values to swallow.



"Essentially, all models are wrong, but some are useful." --George E.P. Box


The p-value test isn't a model, it's a measure of the significance of an effect in data against random noise.


Which arises from a model (!) of random noise and of your effect.


I see - my mistake. That's a very broad definition of 'model' though isn't it? Including 'random numbers'? You might as well say everything is a model in which case the original quote says nothing :-)


It's perhaps a bit like "everything is a model" in the sense that all of these tests, even the model-free ones, arise from a coherent choice of assumptions and, if you for a moment take the Bayesian perspective very seriously, prior distributions over conditionals. The original quote should be taken to mean that any particular choice of assumptions is limiting, but making interesting choices can drive interesting questions which are thought provoking and meaningful even if they are wrong.


[deleted]


All of us?

Scientific experiments always involve some degree of assumption. I'm most skeptical of those who don't state their assumptions.


[deleted]


First off, there's no need for all-caps. This isn't 4chan.

> The fact that assumptions are considered as some unavoidable, forgivable, intricate part of science is part of what fuels anti-science and politics.

No one here, as far as I can tell, is saying, 'oh well, science is full of assumptions therefore science is invalid.' The problem is not with science in general being valid or invalid, but rather with the sorts of experiments being conducted right now. Studies that are not replicable most of the time are bad science, which is different than science itself being 'bad.'

I'm not saying such faulty experiments and studies are unavoidable and forgivable. Quite the opposite, they're flawed and need to be scrutinized more, not less.

Finally, this is not an incurable problem. Bayesian math is one potential solution. There are others.


> 'oh well, science is full of assumptions therefore science is invalid.'

No, they are saying "science is based on assumption, but in this case 'many scientists' were making the wrong assumptions." The title should have said "scientists find errors in p-values as premises." Assumptions are avoided, not depended upon. Bad assumptions would certainly lead to irreplicable experiments also.

The distinction you make between science and its experiments is not a common distinction. If science = experiments, then you totally agree that this kind of science is bad science. Which was precisely my point.


>First off, there's no need for all-caps. This isn't 4chan.

Oh man, it looks like I missed something priceless.


I agree that science strives to remove bias from its body of knowledge, but it's absolutely unavoidable for humans to paint their subjective experiences onto it. Humans have to bring their own preconceptions to any scientific experiment, even the medium by which we convey the knowledge is bathed in assumptions about what those words mean. Every facet of human knowledge is premised on how humans experience the universe. Given your definition, I don't think anything we know would be considered a fact.

https://www.youtube.com/watch?v=cG3sfrK5B4E


> Every facet of human knowledge is premised on how humans experience the universe

Absolutely. But that is why this is where we start. Before science, we had no way to invalidate illusions and validate what was real because assumption on their own are neither. They are naked intuitions pending validation. For the longest time we were unable to validate them and we ended up with the mess we had before science. Basically, no one would ever have made it to Mars.

But with scientific validation knowledge becomes more than just an assumption or an intuition based on an experience, or a theory we came up with that we find ingenious because, well, we came up with it. By overcoming our assumptions we achieve objectivity, universality, and factuality. We discover knowledge that has rigid practical persistence. In this process something transcends from our subjective personal ideas to becoming objective impersonal facts. There is no self in science. And it is from this arduous feat that technology is born. There is nothing in this monitor or the components of this phone that are based on assumptions. These devices are selfless.

"Assumption" is as evil a word as "metaphysics" and "subjective" in science. Yet, there are still people who use the word as a synonym for axiom. This is simply bad word-choice. The correct term here would be "premise" and you used it yourself. Theories can have premises, but not assumptions. Are the premises assumed? No. They are granted.

Since this is HN, here is an analogy to software. A program that assumes certain behavior code or of external APIs will be rigged with bugs. Every aspect of it's execution must be tested, and the assumptions of the programmer must be eliminated by production. Of course, being human, we start with assumptions - such as "this would be the perfect library for this project". The "assumptions" that we being with however, eventually manifest themselves into "premises". And in software, these are the dependencies of a program. It is only natural for software to be dependent on other software. What is unnatural and anti-software would be to make assumptions about other software especially within its own execution.

The path from the assumptions of subjective raw experience to the subjective consumption of reliable technology is paved with the work of competent scientists (and analogously, by competent programmers).

> Given your definition, I don't think anything we know would be considered a fact.

If fact is to mean truth, then sure. But there is an abundance of statements that have been backed by evidence. And all these statements are truer than most. Measurably truthier, rather, and that is what counts because that leave room for progress. This is a better definition of "fact".


How about the assumption that the fundamental constants of the universe are not slowly changing day-to-day?


A better word for that would be "premise" or "axiom". Premises and axioms are objective exceptions with objective merits. Assumptions are too personal because they infer belief which is purely subjective. Nature doesn't care about what anyone believes and science should never be a democracy.

Assumptions also imply some independent existential entity as valid and are self-validating, whereas premises and axioms are highly self-deprecating. Hence, assumptions are dogmatic self-fulfilling prophesies that are an end unto itself, whereas premises are unfortunate unavoidable constraints as a means to an end. They are what couldn't be eliminated despite towering doubt and cynicism. Premises lead to science, axioms to logic, and assumptions to religion (and the like -- not saying it is good or bad).


You're making a distinction that I've never heard anyone make before, and I don't think you're making a convincing argument for it now.


The distinction between axiom and assumption is quite clear, so I'll assume you are referring to premise.

The distinction already exists, which is the beauty of words. I am not making this distinction up. I am merely enforcing them as we do as speakers by the words we choose, based on the accuracy of our expressions.

From Popper:

> A theoretical system may be said to be axiomatized if ... (d) necessary, for the same purpose; which means that they should contain no superfluous assumptions [0].

So either Popper is wrong, or theories should not include unnecessary assumptions. But how do we know if an assumption is necessary without doing the science? And after we do it, are we still going to call necessary assumptions assumptions, even with its subjective implications? Only a person is capable of assuming. A theory with assumptions is still subjective.

A premise on the other hand is objective and specific. It's "a proposition supporting or helping to support a conclusion" [1]. It's a simple device in logic that asserts a dependency. Or in other words, a "necessary assumption".

So then would it not be safer to say axiomatized theories have premises, not assumptions? And in Popper's words, yet to be axiomatized theories have assumptions. That is what makes them hypotheses. And so all the words and their distinctions fall cleanly into place (I did not make anything up).

The original article in Nature was written on the premise that science is based on assumptions and that scientists doing the science rely on assumptions. This is not the premise of science, and is incorrect. Refining our word selection to reflect this understanding would be of great service, particularly to the students.

-- [0] https://books.google.com/books?id=cAKCAgAAQBAJ&pg=PA51&lpg=P... [1] Just from the dictionary, as to avoid my own words. http://dictionary.reference.com/browse/premise




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: