Data analysis, computation and mathematics

Author:
John W. Tukey

Journal:
Quart. Appl. Math. **30** (1972), 51-65

DOI:
https://doi.org/10.1090/qam/99740

MathSciNet review:
QAM99740

Full-text PDF Free Access

Abstract | References | Additional Information

Abstract: "Data analysis'' instead of ``statistics'' is a name that allows us to use probability where it is needed and avoid it when we should. Data analysis has to analyze real data. Most real data calls for data investigation, while almost all statistical theory is concerned with data processing. This can be borne, in part because large segments of data investigation are, by themselves, data processing. Summarizing a batch of 20 numbers is a convenient paradigm for more complex aims in data analysis. A particular summary, highly competitive among those known and known about in August 1971, is a hybrid between two moderately complex summaries. Data investigation comes in three stages: exploratory data analysis (no probability), rough confirmatory data analysis (sign test procedures and the like), mustering and borrowing strength (the best of modern robust techniques, and an art of knowing when to stop). Exploratory data analysis can be improved by being made more resistant, either with medians or with fancier summaries. Rough confirmatory data analysis can be improved by facing up to the issues surrounding the choice of what is to be confirmed or disaffirmed. Borrowing strength is imbedded in our classical procedures, though we often forget this. Mustering strength calls for the best in robust summaries we can supply. The sampling behavior of such a summary as the hybrid mentioned above is not going to be learned through the mathematics of certainty, at least as we know it today, especially if we are realistic about the diversity of non-Gaussian situations that are studied. The mathematics of simulation, inevitably involving the mathematically sound ``swindles'' of Monte Carlo, will be our trust and reliance. I illustrate results for a few summaries, including the hybrid mentioned above. Bayesian techniques are still a problem to the author, mainly because there seems to be no agreement on what their essence is. From my own point of view, some statements of their essence are wholly acceptable and others are equally unacceptable. The use of exogeneous information in analyzing a given body of data is a very different thing (a) depending on sample size and (b) depending on just how the exogeneous information is used. It would be a very fine thing if the questions that practical data analysis has to have answered could be answered by the mathematics of certainty. For my own part, I see no escape, for the next decade or so at least, from a dependence on the mathematics of simulation, in which we should heed von Neumann's aphorism as much as we can.

**[1]**D. F. Andrews, P. J. Bickel, F. R. Hampel, P. J. Huber, W. H. Rogers and J. W. Tukey,*Robust estimates of location: survey and advances*, Princeton University Press, 1972 MR**0331595****[2]**F. J. Anscombe,*Topics in the investigation of linear relations fitted by the method of least squares, J. Roy. Stat. Soc*.**B29**, 1-29 and 49-52 (1967), especially page 3, footnote MR**0212941****[3]**J. M. Dickey,*Bayesian alternatives to the F-test*, presented at the joint statistical meetings, Ft. Collins, Colorado, 26 August 1971 (Also: Research Report No. 50, Statistics Department, State University of New York at Buffalo, Revised version, August, 1971)**[4]**Churchill Eisenhart,*The development of the concept of the best mean of a set of measurements from antiquity to the present day*, Presidential address to the American Statistical Association, Ft. Collins, Colorado, 24 August 1971**[5]**J. C. Gower and G. J. S. Ross,*Minimum spanning trees and single linkage cluster analysis*, Appl. Stat.**18**, 54-64 (1969) MR**0242315****[6]**J. C. Hammersley and K. W. Morton,*A new Monte Carlo technique: antithetic variates*, Proc. Camb. Phil. Soc.**52**, 449-475 (1956) MR**0080984****[7]**P. J. Huber,*Robust estimation of a location parameter*, Ann. Math. Statistics**35**, 73-101 (1964); see proposal 2 on page 96 MR**0161415****[8]**Rudyard Kipling,*Soldiers Three and Military Tales, Part 1, in The works of Rudyard Kipling*, volume 2, Charles Scribner's Sons, New York, 1897; see page 159**[9]**N. Mantel,*An extension of the Buffon needle problem*, Ann. Math. Statistics**24**, 624-677 (1954) MR**0057481****[10]**Frederick Mosteller and D. L. Wallace,*Inference and disputed authorship: The Federalist*, Addison-Wesley, Reading, Mass. 1964 MR**0175668****[11]**W. H. Rogers and J. W. Tukey, in preparation**[12]**J. W. Tukey,*Antithesis or regression?*Proc. Camb. Phil. Soc.**53**, 923-924 (1957) MR**0091586****[13]**J. W. Tukey,*Exploratory data analysis*, three volumes, limited preliminary edition, Addison-Wesley, Reading, Mass., 1970-71; see first page of preface**[14]**J. W. Tukey,*Lags in statistical technology*, presented at the first Canadian Conference on Applied Statistics, June 2, 1971; to appear in the Proceedings of that Conference, 1972

Additional Information

DOI:
https://doi.org/10.1090/qam/99740

Article copyright:
© Copyright 1972
American Mathematical Society