Remote Access Bulletin of the American Mathematical Society

Bulletin of the American Mathematical Society

ISSN 1088-9485(online) ISSN 0273-0979(print)



Statistical proof? The problem of irreproducibility

Author: Susan Holmes
Journal: Bull. Amer. Math. Soc.
MSC (2010): Primary 62F03, 62F15
Published electronically: October 4, 2017
Full-text PDF Free Access

Abstract | References | Similar Articles | Additional Information

Abstract: Data currently generated in the fields of ecology, medicine, climatology, and neuroscience often contain tens of thousands of measured variables. If special care is not taken, the complexity associated with statistical analysis of such data can lead to publication of results that prove to be irreproducible.

The field of modern statistics has had to revisit the classical hypothesis testing paradigm to accommodate modern high-throughput settings. A first step is correction for multiplicity in the number of possible variables selected as significant using multiple hypotheses correction to ensure false discovery rate (FDR) control (Benjamini, Hochberg, 1995). FDR adjustments do not solve the problem of double dipping the data, and recent work develops a field known as post-selection inference that enables inference when the same data is used both to choose and to evaluate models.

It remains that the complexity of software and flexibility of choices in tuning parameters can bias the output toward inflation of significant results; neuroscientists recently revisited the problem and found that many fMRI studies have resulted in false positives.

Unfortunately, all formal correction methods are tailored for specific settings and do not take into account the flexibility available to today's statisticians. A constructive way forward is to be transparent about the analyses performed, separate the exploratory and confirmatory phases of the analyses, and provide open access code; this will result in both enhanced reproducibility and replicability.

References [Enhancements On Off] (What's this?)

  • [1] Robert J. Adler and Jonathan E. Taylor, Topological complexity of smooth random functions, Lecture Notes in Mathematics, vol. 2019, Springer, Heidelberg, 2011. Lectures from the 39th Probability Summer School held in Saint-Flour, 2009; École d’Été de Probabilités de Saint-Flour. [Saint-Flour Probability Summer School]. MR 2768175
  • [2] Ery Arias-Castro, Emmanuel J. Candès, and Yaniv Plan, Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism, Ann. Statist. 39 (2011), no. 5, 2533–2556. MR 2906877,
  • [3] Ery Arias-Castro and Shiyun Chen, Distribution-free multiple testing, Electron. J. Stat. 11 (2017), no. 1, 1983–2001. MR 3651021,
  • [4] Ery Arias-Castro and Nicolas Verzelen, Community detection in dense random networks, Ann. Statist. 42 (2014), no. 3, 940–969. MR 3210992,
  • [5] Keith A. Baggerly and Kevin R. Coombes, Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology, Ann. Appl. Stat. 3 (2009), no. 4, 1309–1334. MR 2752136,
  • [6] Rina Foygel Barber and Emmanuel J. Candès, Controlling the false discovery rate via knockoffs, Ann. Statist. 43 (2015), no. 5, 2055–2085. MR 3375876,
  • [7] Rina Foygel Barber and Emmanuel J Candès, A knockoff filter for high-dimensional selective inference, arXiv:1602.03574, 10 February 2016.
  • [8] Yoav Benjamini and Yosef Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. Roy. Statist. Soc. Ser. B 57 (1995), no. 1, 289–300. MR 1325392
  • [9] Yoav Benjamini and Yosef Hochberg, On the adaptive control of the false discovery rate in multiple testing with independent statistics, J. Educ. Behav. Stat. 25 (2000), no. 1, 60-83.
  • [10] Yoav Benjamini and Daniel Yekutieli, The control of the false discovery rate in multiple testing under dependency, Ann. Statist. 29 (2001), no. 4, 1165–1188. MR 1869245,
  • [11] Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao, Valid post-selection inference, Ann. Statist. 41 (2013), no. 2, 802–837. MR 3099122,
  • [12] Richard Berk, Lawrence Brown, and Linda Zhao, Statistical inference after model selection, J. Quant. Criminol. 26 (2009), no. 2, 217-236.
  • [13] Małgorzata Bogdan, Ewout van den Berg, Chiara Sabatti, Weijie Su, and Emmanuel J. Candès, SLOPE—adaptive variable selection via convex optimization, Ann. Appl. Stat. 9 (2015), no. 3, 1103–1140. MR 3418717,
  • [14] Frank Bretz, Torsten Hothorn, and Peter Westfall, Multiple comparisons using R, CRC Press, 2016.
  • [15] Keith M Briggs, Linlin Song, and Thomas Prellberg, A note on the distribution of the maximum of a set of Poisson random variables, arXiv:0903.4373, 2009.
  • [16] Peter Bühlmann, Statistical significance in high-dimensional linear models, Bernoulli 19 (2013), no. 4, 1212–1242. MR 3102549,
  • [17] A. Buja and L. Brown, Discussion: “A significance test for the lasso” [MR3210970], Ann. Statist. 42 (2014), no. 2, 509–517. MR 3210976,
  • [18] Katherine SButton, John P. A. Ioannidis, Claire Mokrysz, Brian A. Nosek, Jonathan Flint, Emma S. J. Robinson, and Marcus R. Munafò, Power failure: why small sample size undermines the reliability of neuroscience, Nature Reviews Neuroscience 14 (2013), no. 5, 365-376.
  • [19] Benjamin Callahan, Diana Proctor, David Relman, Julia Fukuyama, and Susan Holmes, Reproducible research workflow in R for the analysis of personalized human microbiome data, Pacific Symposium on Biocomputing., vol. 21, NIH Public Access, 2016, p. 183.
  • [20] David R. Cox, Discussion: Comment on a paper by Jager and Leek, Biostatistics 15 (2014), no. 1, 16-8; Discussion 39-45.
  • [21] Anthony D’Aristotile, Persi Diaconis, and David Freedman, On merging of probabilities, Sankhyā Ser. A 50 (1988), no. 3, 363–380. MR 1065549
  • [22] Persi Diaconis, Magical thinking in the analysis of scientific data, Annals of the New York Academy of Sciences 364 (1981), no. 1, 236-244.
  • [23] Daniel B. DiGiulio, Benjamin J. Callahan, Paul J. McMurdie, Elizabeth K. Costello, Deirdre J. Lyell, Anna Robaczewska, Christine L. Sun, Daniela S. A. Goltsman, Ronald J. Wong, Gary Shaw, David K. Stevenson, Susan P. Holmes, and David A. Relman, Temporal and spatial variation of the human microbiota during pregnancy, Proc. Natl. Acad. Sci. U. S. A. 112 (2015), no. 35, 11060-11065.
  • [24] David Donoho and Jiashun Jin, Higher criticism for detecting sparse heterogeneous mixtures, Ann. Statist. 32 (2004), no. 3, 962–994. MR 2065195,
  • [25] David Donoho and Jiashun Jin, Higher criticism for large-scale inference, especially for rare and weak effects, Statist. Sci. 30 (2015), no. 1, 1–25. MR 3317751,
  • [26] Bradley Efron, Large-scale inference, Institute of Mathematical Statistics (IMS) Monographs, vol. 1, Cambridge University Press, Cambridge, 2010. Empirical Bayes methods for estimation, testing, and prediction. MR 2724758
  • [27] Bradley Efron, Estimation and accuracy after model selection, J. Amer. Statist. Assoc. 109 (2014), no. 507, 991–1007. MR 3265671,
  • [28] Bradley Efron and Robert Tibshirani, Empirical Bayes methods and false discovery rates for microarrays, Genetic Epidemiology 23 (2002), no. 1, 70-86.
  • [29] Bradley Efron, Robert Tibshirani, John D. Storey, and Virginia Tusher, Empirical Bayes analysis of a microarray experiment, J. Amer. Statist. Assoc. 96 (2001), no. 456, 1151–1160. MR 1946571,
  • [30] Anders Eklund, Thomas E. Nichols, and Hans Knutsson, Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates, Proc. Natl. Acad. Sci. U. S. A. 113 (2016), no. 28, 7900-7905.
  • [31] Ronald A. Fisher, The arrangement of field experiments, Breakthroughs in Statistics, Springer, 1992, Originally published 1926, pp. 82-91.
  • [32] William Fithian, Dennis Sun, and Jonathan Taylor, Optimal inference after model selection, 9 October 2014.
  • [33] Andrew Gelman and Keith O'Rourke, Discussion: Difficulties in making inferences about scientific truth from distributions of published p-values, Biostatistics 15 (2014), no. 1, 18-23; Discussion 39-45.
  • [34] Jeremy Goecks, Anton Nekrutenko, and James Taylor, Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences, Genome Biology 11 (2010), no. 8, 1.
  • [35] Peter Hall and Jiashun Jin, Properties of higher criticism under strong dependence, Ann. Statist. 36 (2008), no. 1, 381–402. MR 2387976,
  • [36] Peter Hall and Jiashun Jin, Innovated higher criticism for detecting sparse signals in correlated noise, Ann. Statist. 38 (2010), no. 3, 1686–1732. MR 2662357,
  • [37] Xiaoying Tian Harris, Snigdha Panigrahi, Jelena Markovic, Nan Bi, and Jonathan Taylor, Selective sampling after solving a convex problem, arXiv:1609.05609, 19 September 2016.
  • [38] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The elements of statistical learning, 2nd ed., Springer Series in Statistics, Springer, New York, 2009. Data mining, inference, and prediction. MR 2722294
  • [39] David C. Hoaglin, Frederick Mosteller, and John W. Tukey, Exploring data tables, trends, and shapes, John Wiley & Sons, 2011.
  • [40] Susan Holmes and Wolfgang Huber, Modern statistics for modern biology, Cambridge University Press, 2017, to appear.
  • [41] Wolfgang Huber, Vincent J. Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S. Carvalho, Hector Corrada Bravo, Sean Davis, Laurent Gatto, Thomas Girke, Raphael Gottardo, Florian Hahne, Kasper D. Hansen, Rafael A. Irizarry, Michael Lawrence, Michael I. Love, James MacDonald, Valerie Obenchain, Andrzej K. Oleś, Hervé Pagès, Alejandro Reyes, Paul Shannon, Gordon K. Smyth, Dan Tenenbaum, Levi Waldron, and Martin Morgan, Orchestrating high-throughput genomic analysis with Bioconductor, Nat. Methods 12 (2015), no. 2, 115-121.
  • [42] Nikolaos Ignatiadis, Bernd Klaus, Judith B. Zaugg, and Wolfgang Huber, Data-driven hypothesis weighting increases detection power in genome-scale multiple testing, Nat. Methods 13 (2016), no. 7, 577-580.
  • [43] Yu. I. Ingster, Some problems of hypothesis testing leading to infinitely divisible distributions, Math. Methods Statist. 6 (1997), no. 1, 47–69. MR 1456646
  • [44] John P. A. Ioannidis, Why most published research findings are false, Chance 18 (2005), no. 4, 40–47. MR 2216666,
  • [45] John P. A. Ioannidis, David B. Allison, Catherine A. Ball, Issa Coulibaly, Xiangqin Cui, Aedín C. Culhane, Mario Falchi, Cesare Furlanello, Laurence Game, Giuseppe Jurman, J. Mangion, T. Mehta, M. Nitzberg, G. P. Page, E. Petretto, and V. van Noort, Repeatability of published microarray gene expression analyses, Nature Genetics 41 (2009), no. 2, 149-155.
  • [46] Leah R. Jager and Jeffrey T. Leek, An estimate of the science-wise false discovery rate and application to the top medical literature, Biostatistics 15 (2014), no. 1, 1-12.
  • [47] Jana Janková and Sara van de Geer, Honest confidence regions and optimality in high-dimensional precision matrix estimation, arXiv:1507.02061, 8 July 2015.
  • [48] Lucas Janson, Rina Foygel Barber, and Emmanuel Candès, EigenPrism: Inference for high-dimensional signal-to-noise ratios, arXiv:1505.02097, 8 May 2015.
  • [49] Joshua R. Klein and Aaron Roodman, Blind analysis in nuclear and particle physics, Annu. Rev. Nucl. Part. Sci. 55 (2005), 141-163.
  • [50] Nikolaus Kriegeskorte, W. Kyle Simmons, Patrick S. F. Bellgowan, and Chris I. Baker, Circular analysis in systems neuroscience: the dangers of double dipping, Nat. Neurosci. 12 (2009), no. 5, 535-540.
  • [51] Jason D. Lee, Dennis L. Sun, Yuekai Sun, and Jonathan E. Taylor, Exact post-selection inference, with application to the lasso, Ann. Statist. 44 (2016), no. 3, 907–927. MR 3485948,
  • [52] Hannes Leeb, Benedikt M. Pötscher, and Karl Ewald, On various confidence intervals post-model-selection, Statist. Sci. 30 (2015), no. 2, 216–227. MR 3353104,
  • [53] Jeffery T. Leek and Roger D. Peng, What is the question?, Science 347 (2015), no. 6228, 1314-1315.
  • [54] Jeffrey T. Leek and Roger D. Peng, Opinion: Reproducible research can still be wrong: Adopting a prevention approach, Proceedings of the National Academy of Sciences 112 (2015), no. 6, 1645-1646.
  • [55] E. L. Lehmann and Joseph P. Romano, Testing statistical hypotheses, 3rd ed., Springer Texts in Statistics, Springer, New York, 2005. MR 2135927
  • [56] Jonah Lehrer, The truth wears off, The New Yorker 13 (2010), no. 52, 229.
  • [57] Ang Li and Rina Foygel Barber, Accumulation tests for FDR control in ordered hypothesis testing, J. Amer. Statist. Assoc. 112 (2017), no. 518, 837–849. MR 3671774,
  • [58] Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, and Robert Tibshirani, A significance test for the lasso, Ann. Statist. 42 (2014), no. 2, 413–468. MR 3210970,
  • [59] Colin L. Mallows and John W. Tukey, An overview of techniques of data analysis, emphasizing its exploratory aspects, Some recent advances in statistics, Academic Press, London, 1982, pp. 111–172. MR 773678
  • [60] Brendan McKay, Dror Bar-Natan, Maya Bar-Hillel, and Gil Kalai, Solving the bible code puzzle, Statistical Science (1999), 150-173.
  • [61] K. Jarrod Millman and Fernando Pérez, Developing open-source scientific practice, Implementing Reproducible Research. CRC Press, Boca Raton, FL (2014), 149-183.
  • [62] David Mumford, Intelligent design found in the sky with $ p<0.001$, Newsletter of the Swedish Mathematical Society (2009),
  • [63] Michael A. Newton, Christina M. Kendziorski, Craig S. Richmond, Frederick R. Blattner, and Kam-Wah Tsui, On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data, Journal of Computational Biology 8 (2001), no. 1, 37-52.
  • [64] Snigdha Panigrahi, Jonathan Taylor, and Asaf Weinstein, Bayesian post-selection inference in the linear model, arXiv:1605.08824, 28 May 2016.
  • [65] Prasad Patil, Roger D. Peng, and Jeffrey Leek, A statistical definition for reproducibility and replicability, bioRxiv:066803, 2016.
  • [66] Roger Peng, The reproducibility crisis in science: A statistical counterattack, Significance 12 (2015), no. 3, 30-32.
  • [67] Russell A. Poldrack and Krzysztof J. Gorgolewski, Making big data open: data sharing in neuroimaging, Nat. Neurosci. 17 (2014), no. 11, 1510-1517.
  • [68] Florian Prinz, Thomas Schlange, and Khusru Asadullah, Believe it or not: how much can we rely on published data on potential drug targets?, Nature Reviews Drug Discovery 10 (2011), no. 9, 712-712.
  • [69] Soo-Yon Rhee, Matthew J. Gonzales, Rami Kantor, Bradley J. Betts, Jaideep Ravela, and Robert W. Shafer, Human immunodeficiency virus reverse transcriptase and protease sequence database, Nucleic Acids Research 31 (2003), no. 1, 298-303.
  • [70] Robert Rosenthal, The file drawer problem and tolerance for null results., Psychological Bulletin 86 (1979), no. 3, 638.
  • [71] Henry Scheffé, A method for judging all contrasts in the analysis of variance, Biometrika 40 (1953), 87–104. MR 0057504,
  • [72] Juliet Popper Shaffer, Multiple hypothesis testing, Annu. Rev. Psychol. 46 (1995), 561.
  • [73] R. J. Simes, An improved Bonferroni procedure for multiple tests of significance, Biometrika 73 (1986), no. 3, 751–754. MR 897872,
  • [74] Uri Simonsohn, Leif D. Nelson, and Joseph P. Simmons, P-curve: a key to the file-drawer, Journal of Experimental Psychology: General 143 (2014), no. 2, 534.
  • [75] Matthew Stephens, False discovery rates: a new deal, Biostatistics, 17 October 2016.
  • [76] John D. Storey, A direct approach to false discovery rates, J. R. Stat. Soc. Ser. B Stat. Methodol. 64 (2002), no. 3, 479–498. MR 1924302,
  • [77] John D. Storey, The positive false discovery rate: a Bayesian interpretation and the 𝑞-value, Ann. Statist. 31 (2003), no. 6, 2013–2035. MR 2036398,
  • [78] John D. Storey and Robert Tibshirani, Statistical significance for genomewide studies, Proc. Natl. Acad. Sci. USA 100 (2003), no. 16, 9440–9445. MR 1994856,
  • [79] Nassim Nicholas Taleb, The black swan: The impact of the highly improbable, Random House, 2007.
  • [80] Jonathan Taylor and Robert Tibshirani, Post-selection inference for L1-penalized likelihood models, arXiv:1602.07358, 24 February 2016.
  • [81] Jonathan Taylor and Robert J. Tibshirani, Statistical learning and selective inference, Proc. Natl. Acad. Sci. USA 112 (2015), no. 25, 7629–7634. MR 3371123,
  • [82] Jonathan E. Taylor, Joshua R. Loftus, and Ryan J. Tibshirani, Inference in adaptive regression via the Kac-Rice formula, Ann. Statist. 44 (2016), no. 2, 743–770. MR 3476616,
  • [83] Xiaoying Tian, Nan Bi, and Jonathan Taylor, MAGIC: a general, powerful and tractable method for selective inference, arXiv:1607.02630, 9 July 2016.
  • [84] Xiaoying Tian and Jonathan E. Taylor, Selective inference with a randomized response, arXiv:1507.06739, 24 July 2015.
  • [85] Ryan J. Tibshirani, Jonathan Taylor, Richard Lockhart, and Robert Tibshirani, Exact post-selection inference for sequential regression procedures, J. Amer. Statist. Assoc. 111 (2016), no. 514, 600–620. MR 3538689,
  • [86] John W. Tukey, The problem of multiple comparisons: Introduction and Parts A, B, and C, Princeton University, 1953.
  • [87] John W. Tukey, The philosophy of multiple comparisons, Statistical Science (1991), 100-116.
  • [88] John W. Tukey, The collected works of John W. Tukey. Vol. VIII, Chapman & Hall, New York, 1994. Multiple comparisons: 1948–1983; With a preface by William S. Cleveland; With a biography by Frederick Mosteller; Edited and with an introduction and comments by Henry I. Braun. MR 1263027
  • [89] Sara van de Geer, Peter Bühlmann, Ya’acov Ritov, and Ruben Dezeure, On asymptotically optimal confidence regions and tests for high-dimensional models, Ann. Statist. 42 (2014), no. 3, 1166–1202. MR 3224285,
  • [90] Stefan Wager, Wenfei Du, Jonathan Taylor, and Robert J. Tibshirani, High-dimensional regression adjustments in randomized experiments, Proc. Natl. Acad. Sci. USA 113 (2016), no. 45, 12673–12678. MR 3576188,
  • [91] Peter H. Westfall, A. Krishen, and Stanley S. Young, Using prior information to allocate significance levels for multiple endpoints, Stat. Med. 17 (1998), no. 18, 2107-2119.
  • [92] Doron Witztum, Eliyahu Rips, and Yoav Rosenberg, Equidistant letter sequences in the book of genesis, Statistical Science 9 (1994), no. 3, 429-438.
  • [93] Daniel Yekutieli, Adjusted Bayesian inference for selected parameters, J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 (2012), no. 3, 515–541. MR 2925372,
  • [94] Hong Zhang and Zheyang Wu, SetTest: Group testing procedures for signal detection and goodness-of-fit, 2017, R package version 0.1.0.

Similar Articles

Retrieve articles in Bulletin of the American Mathematical Society with MSC (2010): 62F03, 62F15

Retrieve articles in all journals with MSC (2010): 62F03, 62F15

Additional Information

Susan Holmes
Affiliation: Statistics Department Sequoia Hall, Stanford, California 94305

Received by editor(s): June 15, 2017
Published electronically: October 4, 2017
Additional Notes: This work was supported by a Stanford Gabilan fellowship.
Article copyright: © Copyright 2017 by the author under Creative Commons Attribution 4.0 License (CC BY 4.0)