Statistical proof? The problem of irreproducibility

Holmes, Susan

doi:10.1090/bull/1597

Statistical proof? The problem of irreproducibility
HTML articles powered by AMS MathViewer

by Susan Holmes PDF

Bull. Amer. Math. Soc. 55 (2018), 31-55

Abstract:

Data currently generated in the fields of ecology, medicine, climatology, and neuroscience often contain tens of thousands of measured variables. If special care is not taken, the complexity associated with statistical analysis of such data can lead to publication of results that prove to be irreproducible.

The field of modern statistics has had to revisit the classical hypothesis testing paradigm to accommodate modern high-throughput settings. A first step is correction for multiplicity in the number of possible variables selected as significant using multiple hypotheses correction to ensure false discovery rate (FDR) control (Benjamini, Hochberg, 1995). FDR adjustments do not solve the problem of double dipping the data, and recent work develops a field known as post-selection inference that enables inference when the same data is used both to choose and to evaluate models.

It remains that the complexity of software and flexibility of choices in tuning parameters can bias the output toward inflation of significant results; neuroscientists recently revisited the problem and found that many fMRI studies have resulted in false positives.

Unfortunately, all formal correction methods are tailored for specific settings and do not take into account the flexibility available to today’s statisticians. A constructive way forward is to be transparent about the analyses performed, separate the exploratory and confirmatory phases of the analyses, and provide open access code; this will result in both enhanced reproducibility and replicability.

References

Robert J. Adler and Jonathan E. Taylor, Topological complexity of smooth random functions, Lecture Notes in Mathematics, vol. 2019, Springer, Heidelberg, 2011. Lectures from the 39th Probability Summer School held in Saint-Flour, 2009; École d’Été de Probabilités de Saint-Flour. [Saint-Flour Probability Summer School]. MR 2768175, DOI 10.1007/978-3-642-19580-8
Ery Arias-Castro, Emmanuel J. Candès, and Yaniv Plan, Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism, Ann. Statist. 39 (2011), no. 5, 2533–2556. MR 2906877, DOI 10.1214/11-AOS910
Ery Arias-Castro and Shiyun Chen, Distribution-free multiple testing, Electron. J. Stat. 11 (2017), no. 1, 1983–2001. MR 3651021, DOI 10.1214/17-EJS1277
Ery Arias-Castro and Nicolas Verzelen, Community detection in dense random networks, Ann. Statist. 42 (2014), no. 3, 940–969. MR 3210992, DOI 10.1214/14-AOS1208
Keith A. Baggerly and Kevin R. Coombes, Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology, Ann. Appl. Stat. 3 (2009), no. 4, 1309–1334. MR 2752136, DOI 10.1214/09-AOAS291
Rina Foygel Barber and Emmanuel J. Candès, Controlling the false discovery rate via knockoffs, Ann. Statist. 43 (2015), no. 5, 2055–2085. MR 3375876, DOI 10.1214/15-AOS1337
Rina Foygel Barber and Emmanuel J Candès, A knockoff filter for high-dimensional selective inference, arXiv:1602.03574, 10 February 2016.
Yoav Benjamini and Yosef Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. Roy. Statist. Soc. Ser. B 57 (1995), no. 1, 289–300. MR 1325392
Yoav Benjamini and Yosef Hochberg, On the adaptive control of the false discovery rate in multiple testing with independent statistics, J. Educ. Behav. Stat. 25 (2000), no. 1, 60–83.
Yoav Benjamini and Daniel Yekutieli, The control of the false discovery rate in multiple testing under dependency, Ann. Statist. 29 (2001), no. 4, 1165–1188. MR 1869245, DOI 10.1214/aos/1013699998
Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao, Valid post-selection inference, Ann. Statist. 41 (2013), no. 2, 802–837. MR 3099122, DOI 10.1214/12-AOS1077
Richard Berk, Lawrence Brown, and Linda Zhao, Statistical inference after model selection, J. Quant. Criminol. 26 (2009), no. 2, 217–236.
Małgorzata Bogdan, Ewout van den Berg, Chiara Sabatti, Weijie Su, and Emmanuel J. Candès, SLOPE—adaptive variable selection via convex optimization, Ann. Appl. Stat. 9 (2015), no. 3, 1103–1140. MR 3418717, DOI 10.1214/15-AOAS842
Frank Bretz, Torsten Hothorn, and Peter Westfall, Multiple comparisons using R, CRC Press, 2016.
Keith M Briggs, Linlin Song, and Thomas Prellberg, A note on the distribution of the maximum of a set of Poisson random variables, arXiv:0903.4373, 2009.
Peter Bühlmann, Statistical significance in high-dimensional linear models, Bernoulli 19 (2013), no. 4, 1212–1242. MR 3102549, DOI 10.3150/12-BEJSP11
A. Buja and L. Brown, Discussion: “A significance test for the lasso” [MR3210970], Ann. Statist. 42 (2014), no. 2, 509–517. MR 3210976, DOI 10.1214/14-AOS1175F
Katherine SḂutton, John P. A. Ioannidis, Claire Mokrysz, Brian A. Nosek, Jonathan Flint, Emma S. J. Robinson, and Marcus R. Munafò, Power failure: why small sample size undermines the reliability of neuroscience, Nature Reviews Neuroscience 14 (2013), no. 5, 365–376.
Benjamin Callahan, Diana Proctor, David Relman, Julia Fukuyama, and Susan Holmes, Reproducible research workflow in R for the analysis of personalized human microbiome data, Pacific Symposium on Biocomputing., vol. 21, NIH Public Access, 2016, p. 183.
David R. Cox, Discussion: Comment on a paper by Jager and Leek, Biostatistics 15 (2014), no. 1, 16–8; Discussion 39–45.
Anthony D’Aristotile, Persi Diaconis, and David Freedman, On merging of probabilities, Sankhyā Ser. A 50 (1988), no. 3, 363–380. MR 1065549
Persi Diaconis, Magical thinking in the analysis of scientific data, Annals of the New York Academy of Sciences 364 (1981), no. 1, 236–244.
Daniel B. DiGiulio, Benjamin J. Callahan, Paul J. McMurdie, Elizabeth K. Costello, Deirdre J. Lyell, Anna Robaczewska, Christine L. Sun, Daniela S. A. Goltsman, Ronald J. Wong, Gary Shaw, David K. Stevenson, Susan P. Holmes, and David A. Relman, Temporal and spatial variation of the human microbiota during pregnancy, Proc. Natl. Acad. Sci. U. S. A. 112 (2015), no. 35, 11060–11065.
David Donoho and Jiashun Jin, Higher criticism for detecting sparse heterogeneous mixtures, Ann. Statist. 32 (2004), no. 3, 962–994. MR 2065195, DOI 10.1214/009053604000000265
David Donoho and Jiashun Jin, Higher criticism for large-scale inference, especially for rare and weak effects, Statist. Sci. 30 (2015), no. 1, 1–25. MR 3317751, DOI 10.1214/14-STS506
Bradley Efron, Large-scale inference, Institute of Mathematical Statistics (IMS) Monographs, vol. 1, Cambridge University Press, Cambridge, 2010. Empirical Bayes methods for estimation, testing, and prediction. MR 2724758, DOI 10.1017/CBO9780511761362
Bradley Efron, Estimation and accuracy after model selection, J. Amer. Statist. Assoc. 109 (2014), no. 507, 991–1007. MR 3265671, DOI 10.1080/01621459.2013.823775
Bradley Efron and Robert Tibshirani, Empirical Bayes methods and false discovery rates for microarrays, Genetic Epidemiology 23 (2002), no. 1, 70–86.
Bradley Efron, Robert Tibshirani, John D. Storey, and Virginia Tusher, Empirical Bayes analysis of a microarray experiment, J. Amer. Statist. Assoc. 96 (2001), no. 456, 1151–1160. MR 1946571, DOI 10.1198/016214501753382129
Anders Eklund, Thomas E. Nichols, and Hans Knutsson, Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates, Proc. Natl. Acad. Sci. U. S. A. 113 (2016), no. 28, 7900–7905.
Ronald A. Fisher, The arrangement of field experiments, Breakthroughs in Statistics, Springer, 1992, Originally published 1926, pp. 82–91.
William Fithian, Dennis Sun, and Jonathan Taylor, Optimal inference after model selection, 9 October 2014.
Andrew Gelman and Keith O’Rourke, Discussion: Difficulties in making inferences about scientific truth from distributions of published p-values, Biostatistics 15 (2014), no. 1, 18–23; Discussion 39–45.
Jeremy Goecks, Anton Nekrutenko, and James Taylor, Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences, Genome Biology 11 (2010), no. 8, 1.
Peter Hall and Jiashun Jin, Properties of higher criticism under strong dependence, Ann. Statist. 36 (2008), no. 1, 381–402. MR 2387976, DOI 10.1214/009053607000000767
Peter Hall and Jiashun Jin, Innovated higher criticism for detecting sparse signals in correlated noise, Ann. Statist. 38 (2010), no. 3, 1686–1732. MR 2662357, DOI 10.1214/09-AOS764
Xiaoying Tian Harris, Snigdha Panigrahi, Jelena Markovic, Nan Bi, and Jonathan Taylor, Selective sampling after solving a convex problem, arXiv:1609.05609, 19 September 2016.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The elements of statistical learning, 2nd ed., Springer Series in Statistics, Springer, New York, 2009. Data mining, inference, and prediction. MR 2722294, DOI 10.1007/978-0-387-84858-7
David C. Hoaglin, Frederick Mosteller, and John W. Tukey, Exploring data tables, trends, and shapes, John Wiley & Sons, 2011.
Susan Holmes and Wolfgang Huber, Modern statistics for modern biology, Cambridge University Press, 2017, to appear.
Wolfgang Huber, Vincent J. Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S. Carvalho, Hector Corrada Bravo, Sean Davis, Laurent Gatto, Thomas Girke, Raphael Gottardo, Florian Hahne, Kasper D. Hansen, Rafael A. Irizarry, Michael Lawrence, Michael I. Love, James MacDonald, Valerie Obenchain, Andrzej K. Oleś, Hervé Pagès, Alejandro Reyes, Paul Shannon, Gordon K. Smyth, Dan Tenenbaum, Levi Waldron, and Martin Morgan, Orchestrating high-throughput genomic analysis with Bioconductor, Nat. Methods 12 (2015), no. 2, 115–121.
Nikolaos Ignatiadis, Bernd Klaus, Judith B. Zaugg, and Wolfgang Huber, Data-driven hypothesis weighting increases detection power in genome-scale multiple testing, Nat. Methods 13 (2016), no. 7, 577–580.
Yu. I. Ingster, Some problems of hypothesis testing leading to infinitely divisible distributions, Math. Methods Statist. 6 (1997), no. 1, 47–69. MR 1456646
John P. A. Ioannidis, Why most published research findings are false, Chance 18 (2005), no. 4, 40–47. MR 2216666, DOI 10.1080/09332480.2005.10722754
John P. A. Ioannidis, David B. Allison, Catherine A. Ball, Issa Coulibaly, Xiangqin Cui, Aedín C. Culhane, Mario Falchi, Cesare Furlanello, Laurence Game, Giuseppe Jurman, J. Mangion, T. Mehta, M. Nitzberg, G. P. Page, E. Petretto, and V. van Noort, Repeatability of published microarray gene expression analyses, Nature Genetics 41 (2009), no. 2, 149–155.
Leah R. Jager and Jeffrey T. Leek, An estimate of the science-wise false discovery rate and application to the top medical literature, Biostatistics 15 (2014), no. 1, 1–12.
Jana Janková and Sara van de Geer, Honest confidence regions and optimality in high-dimensional precision matrix estimation, arXiv:1507.02061, 8 July 2015.
Lucas Janson, Rina Foygel Barber, and Emmanuel Candès, EigenPrism: Inference for high-dimensional signal-to-noise ratios, arXiv:1505.02097, 8 May 2015.
Joshua R. Klein and Aaron Roodman, Blind analysis in nuclear and particle physics, Annu. Rev. Nucl. Part. Sci. 55 (2005), 141–163.
Nikolaus Kriegeskorte, W. Kyle Simmons, Patrick S. F. Bellgowan, and Chris I. Baker, Circular analysis in systems neuroscience: the dangers of double dipping, Nat. Neurosci. 12 (2009), no. 5, 535–540.
Jason D. Lee, Dennis L. Sun, Yuekai Sun, and Jonathan E. Taylor, Exact post-selection inference, with application to the lasso, Ann. Statist. 44 (2016), no. 3, 907–927. MR 3485948, DOI 10.1214/15-AOS1371
Hannes Leeb, Benedikt M. Pötscher, and Karl Ewald, On various confidence intervals post-model-selection, Statist. Sci. 30 (2015), no. 2, 216–227. MR 3353104, DOI 10.1214/14-STS507
Jeffery T. Leek and Roger D. Peng, What is the question?, Science 347 (2015), no. 6228, 1314–1315.
Jeffrey T. Leek and Roger D. Peng, Opinion: Reproducible research can still be wrong: Adopting a prevention approach, Proceedings of the National Academy of Sciences 112 (2015), no. 6, 1645–1646.
E. L. Lehmann and Joseph P. Romano, Testing statistical hypotheses, 3rd ed., Springer Texts in Statistics, Springer, New York, 2005. MR 2135927
Jonah Lehrer, The truth wears off, The New Yorker 13 (2010), no. 52, 229.
Ang Li and Rina Foygel Barber, Accumulation tests for FDR control in ordered hypothesis testing, J. Amer. Statist. Assoc. 112 (2017), no. 518, 837–849. MR 3671774, DOI 10.1080/01621459.2016.1180989
Richard Lockhart, Jonathan Taylor, Ryan J. Tibshirani, and Robert Tibshirani, A significance test for the lasso, Ann. Statist. 42 (2014), no. 2, 413–468. MR 3210970, DOI 10.1214/13-AOS1175
Colin L. Mallows and John W. Tukey, An overview of techniques of data analysis, emphasizing its exploratory aspects, Some recent advances in statistics, Academic Press, London, 1982, pp. 111–172. MR 773678
Brendan McKay, Dror Bar-Natan, Maya Bar-Hillel, and Gil Kalai, Solving the bible code puzzle, Statistical Science (1999), 150–173.
K. Jarrod Millman and Fernando Pérez, Developing open-source scientific practice, Implementing Reproducible Research. CRC Press, Boca Raton, FL (2014), 149–183.
David Mumford, Intelligent design found in the sky with $p<0.001$, Newsletter of the Swedish Mathematical Society (2009), http://www.dam.brown.edu/people/mumford/beyond/papers/2009a–Orion-SMS.pdf.
Michael A. Newton, Christina M. Kendziorski, Craig S. Richmond, Frederick R. Blattner, and Kam-Wah Tsui, On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data, Journal of Computational Biology 8 (2001), no. 1, 37–52.
Snigdha Panigrahi, Jonathan Taylor, and Asaf Weinstein, Bayesian post-selection inference in the linear model, arXiv:1605.08824, 28 May 2016.
Prasad Patil, Roger D. Peng, and Jeffrey Leek, A statistical definition for reproducibility and replicability, bioRxiv:066803, 2016.
Roger Peng, The reproducibility crisis in science: A statistical counterattack, Significance 12 (2015), no. 3, 30–32.
Russell A. Poldrack and Krzysztof J. Gorgolewski, Making big data open: data sharing in neuroimaging, Nat. Neurosci. 17 (2014), no. 11, 1510–1517.
Florian Prinz, Thomas Schlange, and Khusru Asadullah, Believe it or not: how much can we rely on published data on potential drug targets?, Nature Reviews Drug Discovery 10 (2011), no. 9, 712–712.
Soo-Yon Rhee, Matthew J. Gonzales, Rami Kantor, Bradley J. Betts, Jaideep Ravela, and Robert W. Shafer, Human immunodeficiency virus reverse transcriptase and protease sequence database, Nucleic Acids Research 31 (2003), no. 1, 298–303.
Robert Rosenthal, The file drawer problem and tolerance for null results., Psychological Bulletin 86 (1979), no. 3, 638.
Henry Scheffé, A method for judging all contrasts in the analysis of variance, Biometrika 40 (1953), 87–104. MR 57504, DOI 10.2307/2333100
Juliet Popper Shaffer, Multiple hypothesis testing, Annu. Rev. Psychol. 46 (1995), 561.
R. J. Simes, An improved Bonferroni procedure for multiple tests of significance, Biometrika 73 (1986), no. 3, 751–754. MR 897872, DOI 10.1093/biomet/73.3.751
Uri Simonsohn, Leif D. Nelson, and Joseph P. Simmons, P-curve: a key to the file-drawer, Journal of Experimental Psychology: General 143 (2014), no. 2, 534.
Matthew Stephens, False discovery rates: a new deal, Biostatistics, 17 October 2016.
John D. Storey, A direct approach to false discovery rates, J. R. Stat. Soc. Ser. B Stat. Methodol. 64 (2002), no. 3, 479–498. MR 1924302, DOI 10.1111/1467-9868.00346
John D. Storey, The positive false discovery rate: a Bayesian interpretation and the $q$-value, Ann. Statist. 31 (2003), no. 6, 2013–2035. MR 2036398, DOI 10.1214/aos/1074290335
John D. Storey and Robert Tibshirani, Statistical significance for genomewide studies, Proc. Natl. Acad. Sci. USA 100 (2003), no. 16, 9440–9445. MR 1994856, DOI 10.1073/pnas.1530509100
Nassim Nicholas Taleb, The black swan: The impact of the highly improbable, Random House, 2007.
Jonathan Taylor and Robert Tibshirani, Post-selection inference for L1-penalized likelihood models, arXiv:1602.07358, 24 February 2016.
Jonathan Taylor and Robert J. Tibshirani, Statistical learning and selective inference, Proc. Natl. Acad. Sci. USA 112 (2015), no. 25, 7629–7634. MR 3371123, DOI 10.1073/pnas.1507583112
Jonathan E. Taylor, Joshua R. Loftus, and Ryan J. Tibshirani, Inference in adaptive regression via the Kac-Rice formula, Ann. Statist. 44 (2016), no. 2, 743–770. MR 3476616, DOI 10.1214/15-AOS1386
Xiaoying Tian, Nan Bi, and Jonathan Taylor, MAGIC: a general, powerful and tractable method for selective inference, arXiv:1607.02630, 9 July 2016.
Xiaoying Tian and Jonathan E. Taylor, Selective inference with a randomized response, arXiv:1507.06739, 24 July 2015.
Ryan J. Tibshirani, Jonathan Taylor, Richard Lockhart, and Robert Tibshirani, Exact post-selection inference for sequential regression procedures, J. Amer. Statist. Assoc. 111 (2016), no. 514, 600–620. MR 3538689, DOI 10.1080/01621459.2015.1108848
John W. Tukey, The problem of multiple comparisons: Introduction and Parts A, B, and C, Princeton University, 1953.
John W. Tukey, The philosophy of multiple comparisons, Statistical Science (1991), 100–116.
John W. Tukey, The collected works of John W. Tukey. Vol. VIII, Chapman & Hall, New York, 1994. Multiple comparisons: 1948–1983; With a preface by William S. Cleveland; With a biography by Frederick Mosteller; Edited and with an introduction and comments by Henry I. Braun. MR 1263027
Sara van de Geer, Peter Bühlmann, Ya’acov Ritov, and Ruben Dezeure, On asymptotically optimal confidence regions and tests for high-dimensional models, Ann. Statist. 42 (2014), no. 3, 1166–1202. MR 3224285, DOI 10.1214/14-AOS1221
Stefan Wager, Wenfei Du, Jonathan Taylor, and Robert J. Tibshirani, High-dimensional regression adjustments in randomized experiments, Proc. Natl. Acad. Sci. USA 113 (2016), no. 45, 12673–12678. MR 3576188, DOI 10.1073/pnas.1614732113
Peter H. Westfall, A. Krishen, and Stanley S. Young, Using prior information to allocate significance levels for multiple endpoints, Stat. Med. 17 (1998), no. 18, 2107–2119.
Doron Witztum, Eliyahu Rips, and Yoav Rosenberg, Equidistant letter sequences in the book of genesis, Statistical Science 9 (1994), no. 3, 429–438.
Daniel Yekutieli, Adjusted Bayesian inference for selected parameters, J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 (2012), no. 3, 515–541. MR 2925372, DOI 10.1111/j.1467-9868.2011.01016.x
Hong Zhang and Zheyang Wu, SetTest: Group testing procedures for signal detection and goodness-of-fit, 2017, R package version 0.1.0.