ISSN 1088-6834(online) ISSN 0894-0347(print)

Testing the manifold hypothesis

Authors: Charles Fefferman, Sanjoy Mitter and Hariharan Narayanan
Journal: J. Amer. Math. Soc. 29 (2016), 983-1049
MSC (2010): Primary 62G08, 62H15; Secondary 55R10, 57R40
DOI: https://doi.org/10.1090/jams/852
Published electronically: February 9, 2016
MathSciNet review: 3522608
Full-text PDF Free Access

The hypothesis that high dimensional data tend to lie in the vicinity of a low dimensional manifold is the basis of manifold learning. The goal of this paper is to develop an algorithm (with accompanying complexity guarantees) for testing the existence of a manifold that fits a probability distribution supported in a separable Hilbert space, only using i.i.d. samples from that distribution. More precisely, our setting is the following. Suppose that data are drawn independently at random from a probability distribution $\mathcal {P}$ supported on the unit ball of a separable Hilbert space $\mathcal {H}$. Let $\mathcal {G}(d, V, \tau )$ be the set of submanifolds of the unit ball of $\mathcal {H}$ whose volume is at most $V$ and reach (which is the supremum of all $r$ such that any point at a distance less than $r$ has a unique nearest point on the manifold) is at least $\tau$. Let $\mathcal {L}(\mathcal {M}, \mathcal {P})$ denote the mean-squared distance of a random point from the probability distribution $\mathcal {P}$ to $\mathcal {M}$. We obtain an algorithm that tests the manifold hypothesis in the following sense.

The algorithm takes i.i.d. random samples from $\mathcal {P}$ as input and determines which of the following two is true (at least one must be):

1. There exists $\mathcal {M} \in \mathcal {G}(d, CV, \frac {\tau }{C})$ such that $\mathcal {L}(\mathcal {M}, \mathcal {P}) \leq C {\epsilon }.$

2. There exists no $\mathcal {M} \in \mathcal {G}(d, V/C, C\tau )$ such that $\mathcal {L}(\mathcal {M}, \mathcal {P}) \leq \frac {\epsilon }{C}.$

The answer is correct with probability at least $1-\delta$.

References [Enhancements On Off] (What's this?)

References

Retrieve articles in Journal of the American Mathematical Society with MSC (2010): 62G08, 62H15, 55R10, 57R40

Retrieve articles in all journals with MSC (2010): 62G08, 62H15, 55R10, 57R40

Charles Fefferman
Affiliation: Department of Mathematics, Princeton University, Princeton, New Jersey 08544
MR Author ID: 65640
Email: cf@math.princeton.edu

Sanjoy Mitter
Affiliation: Laboratory for Information and Decision Systems, MIT, Cambridge, Massachusetts 02139
Email: mitter@mit.edu

Hariharan Narayanan
Affiliation: Department of Statistics and Department of Mathematics, University of Washington, Seattle, Washington 98195
Email: harin@uw.edu

Received by editor(s): March 9, 2014
Received by editor(s) in revised form: February 3, 2015, and August 9, 2015
Published electronically: February 9, 2016
Additional Notes: The first author was supported by NSF grant DMS 1265524, AFOSR grant FA9550-12-1-0425 and U.S.-Israel Binational Science Foundation grant 2014055.
The second author was supported by NSF grant EECS-1135843.