Statistical inference for high-dimension, low-sample-size data
HTML articles powered by AMS MathViewer
- by
Makoto Aoshima and Kazuyoshi Yata
Translated by: Makato Aoshima and Kazuyoshi Yata - Sugaku Expositions 30 (2017), 137-158
- DOI: https://doi.org/10.1090/suga/421
- Published electronically: September 15, 2017
- PDF | Request permission
Abstract:
In this paper, we consider statistical inference for high-dimension, low-sample-size (HDLSS) data. We first show that HDLSS data have distinct geometric representations depending on whether or not the data meets a certain boundary condition. We clarify the limit of the conventional principal component analysis (PCA) for HDLSS data. In order to overcome the curse of dimensionality, we introduce two effective PCAs called the noise-reduction methodology and the cross-data-matrix (CDM) methodology. We further introduce the extended CDM methodology, which offers an unbiased estimator having small asymptotic variance and low computational cost, for feature parameters appearing in high-dimensional data analysis. We give correlation tests and several inferences on multiclass mean vectors for HDLSS data, and discuss sample size determination to ensure prespecified high accuracy for inference. Finally, we introduce two effective discriminant procedures: the geometric classifier and the distance-based classifier, that can hold misclassification rates less than a threshold.References
- Jeongyoun Ahn, J. S. Marron, Keith M. Muller, and Yueh-Yun Chi, The high-dimension, low-sample-size geometric representation holds under mild conditions, Biometrika 94 (2007), no. 3, 760–766. MR 2410023, DOI 10.1093/biomet/asm050
- Makoto Aoshima and Kazuyoshi Yata, Two-stage procedures for high-dimensional data, Sequential Anal. 30 (2011), no. 4, 356–399. MR 2855952, DOI 10.1080/07474946.2011.619088
- Makoto Aoshima and Kazuyoshi Yata, Authors’ response [MR2855953; MR2855954; MR2855955; MR2855956; MR2855957; MR2855958; MR2855959; MR2855952], Sequential Anal. 30 (2011), no. 4, 432–440. MR 2855960, DOI 10.1080/07474946.2011.619102
- Makoto Aoshima and Kazuyoshi Yata, A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data, Ann. Inst. Statist. Math. 66 (2014), no. 5, 983–1010. MR 3250825, DOI 10.1007/s10463-013-0435-8
- Makoto Aoshima and Kazuyoshi Yata, Asymptotic normality for inference on multisample, high-dimensional mean vectors under mild conditions, Methodol. Comput. Appl. Probab. 17 (2015), no. 2, 419–439. MR 3343414, DOI 10.1007/s11009-013-9370-7
- Makoto Aoshima and Kazuyoshi Yata, Effective methodologies for higher-dimensional data, J. Jpn. Stat. Soc. Jpn. Issue 43 (2013), no. 1, 123–150 (Japanese, with English and Japanese summaries). MR 3156012
- Makoto Aoshima and Kazuyoshi Yata, Geometric classifier for multiclass, high-dimensional data, Sequential Anal. 34 (2015), no. 3, 279–294. MR 3384057, DOI 10.1080/07474946.2015.1063256
- M. Aoshima and K. Yata, High-dimensional quadratic classifiers in non-sparse settings, submitted. arXiv:1503.04549
- Zhidong Bai and Hewa Saranadasa, Effect of high dimension: by an example of a two sample problem, Statist. Sinica 6 (1996), no. 2, 311–329. MR 1399305
- Jinho Baik, Gérard Ben Arous, and Sandrine Péché, Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices, Ann. Probab. 33 (2005), no. 5, 1643–1697. MR 2165575, DOI 10.1214/009117905000000233
- Jinho Baik and Jack W. Silverstein, Eigenvalues of large sample covariance matrices of spiked population models, J. Multivariate Anal. 97 (2006), no. 6, 1382–1408. MR 2279680, DOI 10.1016/j.jmva.2005.08.003
- Peter J. Bickel and Elizaveta Levina, Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations, Bernoulli 10 (2004), no. 6, 989–1010. MR 2108040, DOI 10.3150/bj/1106314847
- Peter J. Bickel and Elizaveta Levina, Covariance regularization by thresholding, Ann. Statist. 36 (2008), no. 6, 2577–2604. MR 2485008, DOI 10.1214/08-AOS600
- Peter J. Bickel and Elizaveta Levina, Regularized estimation of large covariance matrices, Ann. Statist. 36 (2008), no. 1, 199–227. MR 2387969, DOI 10.1214/009053607000000758
- Christopher M. Bishop, Pattern recognition and machine learning, Information Science and Statistics, Springer, New York, 2006. MR 2247587, DOI 10.1007/978-0-387-45528-0
- Richard C. Bradley, Basic properties of strong mixing conditions. A survey and some open questions, Probab. Surv. 2 (2005), 107–144. Update of, and a supplement to, the 1986 original. MR 2178042, DOI 10.1214/154957805100000104
- Yao-Ban Chan and Peter Hall, Scale adjustments for classifiers in high-dimensional, low sample size settings, Biometrika 96 (2009), no. 2, 469–478. MR 2507156, DOI 10.1093/biomet/asp007
- Song Xi Chen and Ying-Li Qin, A two-sample test for high-dimensional data with applications to gene-set testing, Ann. Statist. 38 (2010), no. 2, 808–835. MR 2604697, DOI 10.1214/09-AOS716
- Song Xi Chen, Li-Xin Zhang, and Ping-Shou Zhong, Tests for high-dimensional covariance matrices, J. Amer. Statist. Assoc. 105 (2010), no. 490, 810–819. MR 2724863, DOI 10.1198/jasa.2010.tm09560
- A. P. Dempster, A high dimensional two sample significance test, Ann. Math. Statist. 29 (1958), 995–1010. MR 112207, DOI 10.1214/aoms/1177706437
- A. P. Dempster, A significance test for the separation of two highly multivariate small samples, Biometrics 16 (1960), 41–50. MR 112208, DOI 10.2307/2527954
- Mathias Drton and Michael D. Perlman, Multiple testing and error control in Gaussian graphical model selection, Statist. Sci. 22 (2007), no. 3, 430–449. MR 2416818, DOI 10.1214/088342307000000113
- Sandrine Dudoit, Jane Fridlyand, and Terence P. Speed, Comparison of discrimination methods for the classification of tumors using gene expression data, J. Amer. Statist. Assoc. 97 (2002), no. 457, 77–87. MR 1963389, DOI 10.1198/016214502753479248
- Jianqing Fan and Yingying Fan, High-dimensional classification using features annealed independence rules, Ann. Statist. 36 (2008), no. 6, 2605–2637. MR 2485009, DOI 10.1214/07-AOS504
- Yasunori Fujikoshi, Vladimir V. Ulyanov, and Ryoichi Shimizu, Multivariate statistics, Wiley Series in Probability and Statistics, John Wiley & Sons, Inc., Hoboken, NJ, 2010. High-dimensional and large-sample approximations. MR 2640807, DOI 10.1002/9780470539873
- Peter Hall, J. S. Marron, and Amnon Neeman, Geometric representation of high dimension, low sample size data, J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005), no. 3, 427–444. MR 2155347, DOI 10.1111/j.1467-9868.2005.00510.x
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The elements of statistical learning, 2nd ed., Springer Series in Statistics, Springer, New York, 2009. Data mining, inference, and prediction. MR 2722294, DOI 10.1007/978-0-387-84858-7
- Alfred Hero and Bala Rajaratnam, Large-scale correlation screening, J. Amer. Statist. Assoc. 106 (2011), no. 496, 1540–1552. MR 2896855, DOI 10.1198/jasa.2011.tm11015
- Iain M. Johnstone, On the distribution of the largest eigenvalue in principal components analysis, Ann. Statist. 29 (2001), no. 2, 295–327. MR 1863961, DOI 10.1214/aos/1009210544
- Sungkyu Jung and J. S. Marron, PCA consistency in high dimension, low sample size context, Ann. Statist. 37 (2009), no. 6B, 4104–4130. MR 2572454, DOI 10.1214/09-AOS709
- Seunggeun Lee, Fei Zou, and Fred A. Wright, Convergence and prediction of principal component scores in high-dimensional settings, Ann. Statist. 38 (2010), no. 6, 3605–3629. MR 2766862, DOI 10.1214/10-AOS821
- J. S. Marron, Michael J. Todd, and Jeongyoun Ahn, Distance-weighted discrimination, J. Amer. Statist. Assoc. 102 (2007), no. 480, 1267–1271. MR 2412548, DOI 10.1198/016214507000001120
- Debashis Paul, Asymptotics of sample eigenstructure for a large dimensional spiked covariance model, Statist. Sinica 17 (2007), no. 4, 1617–1642. MR 2399865
- Hewa Saranadasa, Asymptotic expansion of the misclassification probabilities of $D$- and $A$-criteria for discrimination from two high-dimensional populations using the theory of large-dimensional random matrices, J. Multivariate Anal. 46 (1993), no. 1, 154–174. MR 1231251, DOI 10.1006/jmva.1993.1054
- M. S. Srivastava, Multivariate theory for analyzing high dimensional data, J. Japan Statist. Soc. 37 (2007), no. 1, 53–86. MR 2392485, DOI 10.14490/jjss.37.53
- Vladimir N. Vapnik, The nature of statistical learning theory, Springer-Verlag, New York, 1995. MR 1367965, DOI 10.1007/978-1-4757-2440-0
- A. Wille, P. Zimmermann, E. Vranová, A. Fürholz, O. Laule, S. Bleuler, L. Hennig, A. Prelic, P. von Rohr, L. Thiele, E. Zitzler, W. Gruissem and P. Bühlmann, Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana, Genome Biol., 5 (2004), R92.
- Kazuyoshi Yata and Makoto Aoshima, PCA consistency for non-Gaussian data in high dimension, low sample size context, Comm. Statist. Theory Methods 38 (2009), no. 16-17, 2634–2652. MR 2568176, DOI 10.1080/03610910902936083
- Kazuyoshi Yata and Makoto Aoshima, Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data matrix, J. Multivariate Anal. 101 (2010), no. 9, 2060–2077. MR 2671201, DOI 10.1016/j.jmva.2010.04.006
- Kazuyoshi Yata and Makoto Aoshima, Intrinsic dimensionality estimation of high-dimension, low sample size data with $D$-asymptotics, Comm. Statist. Theory Methods 39 (2010), no. 8-9, 1511–1521. MR 2753523, DOI 10.1080/03610920903121999
- Kazuyoshi Yata and Makoto Aoshima, Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations, J. Multivariate Anal. 105 (2012), 193–215. MR 2877512, DOI 10.1016/j.jmva.2011.09.002
- Kazuyoshi Yata and Makoto Aoshima, Inference on high-dimensional mean vectors with fewer observations than the dimension, Methodol. Comput. Appl. Probab. 14 (2012), no. 3, 459–476. MR 2966305, DOI 10.1007/s11009-011-9233-z
- Kazuyoshi Yata and Makoto Aoshima, Correlation tests for high-dimensional data using extended cross-data-matrix methodology, J. Multivariate Anal. 117 (2013), 313–331. MR 3053550, DOI 10.1016/j.jmva.2013.03.007
- Kazuyoshi Yata and Makoto Aoshima, PCA consistency for the power spiked model in high-dimensional settings, J. Multivariate Anal. 122 (2013), 334–354. MR 3189327, DOI 10.1016/j.jmva.2013.08.003
- Ping-Shou Zhong and Song Xi Chen, Tests for high-dimensional regression coefficients with factorial designs, J. Amer. Statist. Assoc. 106 (2011), no. 493, 260–274. MR 2816719, DOI 10.1198/jasa.2011.tm10284
Bibliographic Information
- Makoto Aoshima
- Affiliation: Institute of Mathematics, University of Tsukuba, Ibaraki 305-8571, Japan
- Email: aoshima@math.tsukuba.ac.jp
- Kazuyoshi Yata
- Affiliation: Institute of Mathematics, University of Tsukuba, Ibaraki 305-8571, Japan
- Email: yata@math.tsukuba.ac.jp
- Published electronically: September 15, 2017
- Additional Notes: The research of the first author was partially supported by Grants-in-Aid for Scientific Research (B) and Challenging Exploratory Research, Japan Society for the Promotion of Science (JSPS), under Contract Numbers 22300094 and 23650142
The research of the second author was partially supported by Grant-in-Aid for Young Scientists (B), JSPS, under Contract Number 23740066. - © Copyright 2017 American Mathematical Society
- Journal: Sugaku Expositions 30 (2017), 137-158
- MSC (2010): Primary 62H25, 62H12; Secondary 62H15, 62H30
- DOI: https://doi.org/10.1090/suga/421
- MathSciNet review: 3711762