Notices of the American Mathematical Society

Welcome to the Notices of the American Mathematical Society.
With support from AMS membership, we are pleased to share the journal with the global mathematical community.

The Many Faces of Information Geometry

Frank Nielsen

Communicated by Notices Associate Editor Richard Levine

Article cover

Information geometry Ama16AJLS17Ama21 aims at unravelling the geometric structures of families of probability distributions and at studying their uses in information sciences. Information sciences is an umbrella term regrouping statistics, information theory, signal processing, machine learning and AI, etc. Information geometry was born independently from econometrician H. Hotelling (1930) and statistician C. R. Rao (1945) from the mathematical curiosity of considering a parametric family of probability distributions, called the statistical model, as a Riemannian manifold equipped with the Fisher metric tensor Nie20. Information geometry tackles problems by using the concepts of differential geometry (like curvature) with tensor calculus. In his pioneer work, Rao considered the Riemannian geodesic distance and geodesic balls on the manifold to study classification and hypothesis testing problems in statistics.

Let denote a probability space Kee10 (with sample space , -algebra , and finite positive measure , usually chosen as the Lebesgue mesure or the counting measure ), and consider a parametric family of probability distributions, all dominated by . Let denote the Radon-Nikodym derivative, the probability density function of random variable . By definition, the Fisher Riemannian metric expressed in the -coordinate system is the Fisher information matrix (FIM) of the random variable : with

where is called the score function Kee10. The Fisher metric is also referred to as the Shahshahani metric in mathematical biology. Because the FIM is the covariance matrix of the score (since ), is necessarily positive semidefinite, and positive-definite for regular statistical models Ama16. The FIM is covariant under reparameterization: for any smooth invertible mapping with invertible Jacobian matrix , we have

The Riemannian Fisher length element induced by the Fisher metric

is invariant under any smooth invertible reparameterization: with . Nowadays, the Riemannian manifold is commonly called the Fisher-Rao manifold Nie20, and its induced Riemannian geodesic length distance is called the Fisher-Rao distance with

where denotes the Riemannian geodesic GN14 with the boundary conditions and . Thus the Fisher-Rao distance used to evaluate the dissimilarities between probability distributions is invariant under reparameterization. For example, the Fisher-Rao distance remains the same whether the family of normal distributions are parameterized by , , or :

This parameterization invariance property of statistical distances highlights the power of modeling geometrically statistical models. The Fisher-Rao manifolds may have different constant sectional curvatures : for example, the curvatures of the Fisher-Rao manifolds of univariate normal distributions, univariate zero-centered multivariate normal distributions, and categorical distributions are , , and , respectively. Information geometry elucidates the role played by curvature in statistics. Since any Riemannian manifold of dimension can be isometrically embedded in a Euclidean space of dimension at most using Nash’s embedding theorem, we may visualize the Fisher-Rao manifold as a -dimensional surface of . This extrinsic view of geometry is helpful to intuitively grasp the notion of tangent planes and tangent vectors at any and allows one to visualize geodesics on surfaces. However, let us point out that differential geometry defines intrinsically these notions GN14.

Using the Fisher metric can be justified from several theoretical viewpoints Ama16:

First, the FIM occurs when locally approximating the Kullback-Leibler (KL) divergence Kee10. In statistics, estimating densities using the Maximum Likelihood Estimator (MLE) or the maximum entropy principle under moment constraints (MaxEnt) can be interpreted as KL divergence minimization problems Kee10 (to be detailed below). The KL divergence between densities and is defined by:

The delimiter “:” indicates that the divergence is oriented: . The KL divergence can be expressed as

where denotes the cross-entropy and is Shannon entropy. Hence, the KL divergence is also called relative entropy in information theory. The second-order Taylor approximation of the KL divergence yields

More generally, the FIM is used in the local approximations of -divergences Ama16:

where

for a convex function satisfying , and strictly convex at . The KL divergence is an -divergence obtained for with . The -divergences are said to be separable because they can be written as integrals of scalar divergences:

with . The -divergences enjoy the following monotonicity property:

where and are the densities induced by a Markov kernel from measurable space to measurable space Ama21. To give a concrete example, consider the -divergences between two (normalized) histograms and with bins representing multinomial probability laws. Then , where and with and reduced histograms obtained by merging consecutive bins (a very special deterministic Markov kernel from measurable space to measurable space where ). The only separable statistical divergences satisfying this monotonicity property are -divergences when Ama16.

A sufficient statistic Kee10 for the parameter of a random variable is such that the conditional probability of given does not depend on . That is, all statistical information concerning parameter is contained in . For example, with and is a sufficient statistic for the parameter of a random vector of independent and identically distributed (i.i.d.) random variables , …, following a normal distribution . In general, we have with equality if and only if (iff) is a sufficient statistic Kee10. Let us observe that for i.i.d. random variables , we have with . Sufficiency also characterizes the equality in the monotonicity inequality of -divergences: with equality iff is a sufficient statistic.

Let be an unbiased estimator of of an i.i.d. random vector . Then the following Cramér-Rao lower bound (CRLB) on the variance of holds:

where iff matrix is positive semidefinite. The notation indicates the comparison with respect to (w.r.t.) the Loewner partial ordering of positive semidefinite matrices. Thus the inverse of the FIM provides a lower bound on the accuracy of any unbiased estimator. An estimator is said to be Fisher efficient when its variance asymptotically matches the CRLB when .

The Fisher metric is the only invariant metric under Markovian morphisms of statistical models Ama16. However, let us point out that there are infinitely many counterparts of the FIM in quantum information geometry, and that other alternative Riemannian information metrics can be explored (e.g., the Wasserstein information metric Li21).

In statistics, two special types of statistical models, called the exponential families and the mixture families, are often handled:

An exponential family Kee10 is a set of parametric densities such that

where the form the minimal sufficient statistic. Function is used to normalize the densities:

and called the cumulant function (or log-partition in statistical physics). is strictly convex for (full regular) exponential families Kee10. For example, the family of -variate normal distributions is an exponential family of order w.r.t. the Lebesgue measure , and the family of Poisson distributions is a discrete exponential family of order w.r.t. the counting measure . Exponential families have all finite-dimensional minimal sufficient statistics.

A mixture family Nie20 is a set of parametric densities such that , where functions , , , …, are linearly independent functions. Statistical mixtures such as Gaussian mixture models with prescribed component densities are examples of mixture families. Mixture families are closed under convex combinations. It can be shown that the negentropy of mixture densities is a strictly convex function Nie20: .

For these two types of statistical models, the Fisher metric is a Hessian metric Shi07 since the FIM is the Hessian of some strictly convex potential function : . This is easily checked for exponential families as the FIM can be written under mild regularity conditions Ama16 as .

In general, calculating in closed-form the Fisher-Rao distances may be difficult since it requires to solve the Riemannian geodesic equation with boundary conditions, and to integrate the length elements along geodesics. For example, although the Fisher-Rao distance between univariate normal distributions is available in closed-form, we do not have a closed-form formula for the Fisher-Rao distance between multivariate normals Nie20. Thus in practice, the Fisher-Rao between multivariate normals is numerically approximated. We shall now explain that the Fisher-Rao manifolds with Hessian metrics carry another beautiful geometric dual structure which is well suited for computation in applications: namely, these exponential/mixture families can be modeled as Hessian manifolds Shi07, and are commonly called dually flat spaces in information geometry Ama21.

In the second half of the 20th century, information geometry gained a momentum with the pathbreaking work of N. Chentsov. Chentsov shared statistician A. Wald’s viewpoint that all problems in statistics can be viewed as decision problems, and therefore investigated a theoretical framework for characterizing optimal decision rules in statistics using category theory and Markovian morphisms Čen82. A family of probability distributions should be invariant both under smooth one-to-one transformations of its parameter and under transformations of the corresponding random variables by sufficient statistics. This precisely defines the statistical invariance. Chentsov’s breakthrough consisted in separating the metric tensor (used to measure angles between vectors and lengths of vectors in a tangent plane) from its induced Levi-Civita connection used by default on a Riemannian manifold for obtaining (locally) length minimizing geodesics. This novel insight allowed Chentsov to model statistical models as differentiable manifolds equipped with affine connections more general than the Levi-Civita connection of Fisher-Rao manifolds. More precisely, Chentsov discovered the existence of a unique totally symmetric third-order tensor fulfilling the statistical invariance which is nowadays called the Amari-Chentsov tensor or skewness tensor, and used that tensor to build invariant affine connections. A. Kolmogorov called Chentsov’s field of research “geometrostatistics” in Russian (translated as “geometrical statistics” in the english monograph Čen82).

In short, an affine connection GN14 defines the following:

A way to differentiate a vector field on a manifold (or more generally a tensor field ) by another vector field : namely, the covariant derivatives denoted by (or ).

The parallel transport for a tangent vector along any smooth curve . An affine connection allows to parallel transport vectors of different tangent planes onto a common tangent plane in order to measure their subtended angles using the metric tensor in that common tangent plane.

-geodesics defined as autoparallel curves: . Geodesics are calculated by solving the second-order non-linear ordinary differential equation (ODE):

where , , and are the Christoffel symbols GN14 ( smooth functions) defining the affine connection. In physics, geodesics represent free particle trajectories.

The curvature tensor and the torsion tensor of a manifold are induced by the chosen connection GN14. The fundamental theorem of Riemannian geometry GN14 states that there exists a unique torsion-free affine connection which preserves the metric, meaning that for any two vectors and of the tangent plane , and a smooth curve with , we have for any : . This unique torsion-free affine connection is called the Levi-Civita metric connection. Historically, affine connections were studied by É. Cartan in the 1920s, and used in the Einstein-Cartan theory of gravity. Chentsov considered regular exponential families in his monograph, and by considering their invariance, discovered the so-called exponential connection .

Figure 1.

Fisher-Rao geometry vs. dual -geometry.

The field of information geometry was shaped by S.-i. Amari who dreamt of a mathematical theory of neuroscience. Amari pioneered the dualistic statistical structures of information geometry: that is, Amari showed that given any torsion-free affine connection , there exists a dual torsion-free affine connection such that the mid-connection corresponds to the Levi-Civita connection. This duality ensures that the primal and dual parallel transports are metric compatible: . Notice that the lengths of dually parallel-transported vectors by and may vary along but their inner product is kept constant. For parametric statistical models , Amari reported the -geometry (for ) which is a manifold equipped with the Fisher metric and a pair of dual connections coupled to the Fisher metric : . Amari’s -connections are defined by their Christoffel symbols :

where is the log-likelihood function and . Chentsov’s exponential connection corresponds to Amari’s connection. The -connection was also studied by Efron who defined geometrically the notion of statistical curvature to study the higher-order asymptotic theory of statistical estimators in a landmark paper Efr75 which has been recognized as one of the first successful applications of differential geometry to statistical inference. The dual connection of is , and called the mixture connection. This connection was proposed by P. Dawid, a discussant of Efron’s paper Efr75. Thus the Fisher-Rao geometry can be interpreted as the -geometry enhanced with the Fisher-Rao geodesic distance (Figure 1). The Riemannian geodesics are -autoparallel and have the property to locally minimize the geodesic lengths GN14. In general, the -geometry is not associated with any statistical divergence when . But the -geometry may be recovered from the divergence geometry of invariant -divergences on the probability simplex Egu83 (to be detailed below).

A connection is said to be flat GN14Shi07 when it has zero torsion and when there exists a local coordinate system such that the Christoffel symbols defining expressed in that coordinate system vanish: . The coordinate system is called a -affine coordinate system. In general the parallel transport of to is curve dependent: for smooth curves and with endpoints and . One can visualize locally the presence of curvature of a connection or not at a point by considering the parallel transport of a vector along a closed infinitesimal loop encircling (with ): if there is an angle deficiency between and , then the manifold has non-zero curvature at Nie20. However, the parallel transport is independent of the curves linking the point to the point for flat connections. It is a fundamental result of information geometry that if is flat, then so is with . We get the so-called dually flat spaces of information geometry Ama16 which are special Hessian manifolds Shi07 admitting a single chart atlas. Notice that in a dually flat space, the Levi-Civita connection is usually not flat.

In information theory, a statistical divergence like the KL divergence is loosely speaking a potentially asymmetric dissimilarity measure between probability distributions which may fail the triangle inequality of metric distances. In information geometry, a divergence (historically called a contrast function Egu83) is a smooth dissimilarity measure between parameters and that satisfies the following conditions:

(1)

for all with equality iff .

(2)

for all , where and .

(3)

is a positive-definite matrix.

The (parameter) divergence can also be interpreted as a function on the manifold defined by the single chart equipped with the -coordinate system. Eguchi Egu83 reported a method to build a dualistic structure from any divergence as follows:

It can be shown that the connections and induced respectively by and are torsion-free and dual. This geometry is called the divergence geometry of Ama16. Let denote the dual or reverse divergence, and the information-geometric space induced by . Then we have and . Thus symmetric divergences yield self-dual connections coinciding with the Levi-Civita connection. Many different divergences may yield the same divergence geometry. The divergence geometry of -divergences on the -dimensional probability simplex corresponds to Amari’s -geometry for , and the divergence geometry of yields the -geometry.

Figure 2.

Dually flat space with -affine coordinate system and dual -affine coordinate system . Primal geodesics and dual geodesics are linear when plotted in the -coordinate system and -coordinate system, respectively.

In a dually flat space, we can build a canonically Fenchel-Young (non-metric) divergence :

where denotes the convex conjugate obtained by the Legende-Fenchel transform:

Legende-Fenchel transform yields a dual coordinate system , and the Fenchel-Young inequality ensures that with equality iff . This divergence is shown to be equivalent to a Bregman divergence Ama16: with , where

is the Bregman divergence between parameters and induced by a smooth strictly convex function and denotes the gradient of . Bregman divergences are widely used in machine learning and originated from mathematical programming. Reciprocally, given a Bregman divergence, we can build a dually flat space (the Bregman divergence geometry) with Hessian metric Shi07 (positive-definite because is strictly convex) from its divergence geometry. The dual affine connections are and with Christoffel symbols and , respectively. It follows that in the -coordinate system, the primal geodesics are linear since the -geodesic ODE simplifies to , and in the dual -coordinate system, the dual geodesics are linear since the -geodesic ODE simplifies to (Figure 2). The -coordinate system is said to be -affine. We have and the dual Riemannian metric . It follows that , the identity matrix. The Bregman divergence construction from a dually flat space is defined up to affine dual coordinate transformations and .

Amari’s -geometry for the exponential and mixture families yields dually flat spaces. Their corresponding Bregman divergences yield the following statistical divergences:

For an exponential family Kee10 with density , the Legendre-Fenchel conjugate function of the cumulant function corresponds to Shannon negentropy, , and the Bregman divergence yields the reverse Kullback-Leibler divergence (or reverse relative entropy):

For a mixture family Ama16Nie20 with density , the Bregman generator is strictly convex with , the open -dimensional probability simplex. The canonical Bregman divergence amounts to calculating the Kullback-Leibler divergence Nie20: .

In a dually flat space , a generalized Pythagorean theorem holds: Let , , be three points. A primal geodesic intersects a dual geodesic orthogonally w.r.t. to the metric at point , written as , iff

In that case, the following Pythagorean equality holds:

When , we recover the usual Euclidean geometry (a self-dual flat space) with (since and with ), and we have , where denotes the Euclidean norm. The canonical -divergence is and the dual -divergence is

A -flat is a submanifold such that is an affine subspace of ( is a -autoparallel submanifold). Similarly, an -flat is a submanifold such that is an affine subspace of . Define the -projection of a point onto a submanifold as

Then the -projection of is guaranteed to be a unique point when is an -flat. Moreover, we have in a dually flat space.

These information projections can be used in statistical inference as follows. Consider the probability space with finite discrete sample space and the counting measure. The categorical distributions form both an exponential family and a mixture family Ama16. A categorical probability mass function can be viewed as a point lying on the -dimensional open standard simplex .

The Maximum Likelihood Estimator (MLE) of i.i.d. observations , …, sampled from an exponential family density is . Since the MLE is equivariant Kee10, we have , and it follows that since . The MLE can be interpreted as a divergence minimization problem: , where is the empirical distribution with the ’s denoting the Dirac distributions iff , and otherwise. The MLE can be geometrically interpreted as an -projection (i.e., with respect to ) of onto the -flat exponential family: . Thus the MLE is unique since is -flat. This result holds more generally for estimations on curved exponential families Ama16: for example, the family of normal distributions with is a 1D curved exponential family. By viewing the MLE as a KL divergence minimization problem, we may consider other divergence-based estimators. A divergence yields a -estimator by asking to solve the minimization problem . The MLE is the -estimator. Then we study the properties of various -estimators. For example, the -estimator induced by the -divergence for is proven to be robust to noise contamination Ama16 but not the MLE which is based on the KL divergence. The -divergences tend to the KL divergence in the limit .

The Maximum Entropy (MaxEnt) principle of E. Jaynes Kee10 asks for the probability density which maximizes the Shannon entropy under moment constraints , …, , i.e., with and . It can be shown that the MaxEnt distribution is a density belonging to an exponential family with sufficient statistics . Namely, we have with . The MaxEnt problem can be rewritten as the following minimization problem: , where denotes the uniform distribution on the probability simplex , and is an -flat defined by the moment constraints. By introducing any other prior density , we can thus generalize MaxEnt by the following minimization problem: under the moment constraint . The MaxEnt solution belongs to an exponential family , and we have such that . We interpret the MaxEnt distribution as the unique -projection point (with respect to ) of onto w.r.t. : .

Wong Won18 recently generalized the Legendre-Fenchel transformation used in dually flat spaces, and obtained a new kind of Pythagorean theorem expressed w.r.t. Rényi divergences.

Figure 3.

Genesis of the dual structure of information geometry.

Finally, let us mention that instead of using the invariant -divergences of information geometry, we can use the theory of optimal transport PC19 to measure the distance between any two probability measures. Optimal transport requires defining a ground distance between elements of the sample space to model the elementary cost of mass transportation, and measures the deviation between two probability measures by forward pushing one measure to another by a transportation plan. Although the optimal transport problems between discrete probability measures encountered in practice (i.e., finite weighted point sets) are computationally costly to solve (amount to solve linear programs), fast entropic-regularized methods PC19 and various heuristics like the sliced Wasserstein distances have contributed to its huge success in machine learning and computer vision. Optimal transport does not require the probability measures to have coinciding supports, and can even measure the distance between a discrete measure and a continuous measure. Many fruitful interactions between information geometry and optimal transport are investigated AKO18, and counterpart notions of the FIM and Bregman divergences have been proposed in the probability density space equipped with the -Wasserstein metric Li21.

To summarize, the problem of geometrically modeling a family of probability distributions (the statistical model) is at the heart of information geometry. The Fisher-Rao geometry considers a Riemannian manifold equipped with the Fisher information metric, and uses the Riemannian geodesic length as a measure of dissimilarity between distributions: the Fisher-Rao distance. Amari’s dual -geometry of information geometry has revealed the dualistic structure of affine connections coupled to the Fisher metric. This key dualistic structure is purely geometric and therefore can be used beyond the realm of statistics (for example, when studying optimization algorithms with convex barrier functions Ama16). Dually flat spaces are Hessian manifolds Shi07 with a single-chart atlas where the Legendre-Fenchel transformation plays an essential role to define dual coordinate systems and dual potential functions. Dually flat spaces generalize Euclidean geometry and enjoy a generalized Pythagorean theorem Ama16. Many pioneers have contributed to the now well-established classical dual structure of information geometry: Figure 3 displays historically the main actors who contributed to the genesis of the dual structure of information geometry with achieved milestones.

Recent advances in information geometry studies the geometry of deformed exponential families and their use in thermostatistics Nau11, the geometry of non-parametric models, the quantum information geometry, the Lie group thermodynamics, and the interactions of geometric mechanics with information geometry via symplectic and contact structures. Information geometry has found many applications beyond statistics. We refer to the textbook Ama16 for applications in signal processing, data science, and machine learning. To conclude with an application in machine learning, consider training a neural network parameterized by weights . The neural network is typically trained by using the method of gradient descent to minimize a loss function

defined by a supervised training set of labeled pairs , where denotes the label of : initialize and iteratively update , where denotes the step size. The ordinary gradient depends on the chosen parameterization, i.e., for a smooth invertible parameter transformation . A better parameter-invariant gradient has been proposed in information geometry for optimization on Riemannian manifolds : the natural gradient Ama16 . The natural gradient ensures that . The natural gradient descent is used to train stochastic neural networks with parameter space modeled as a Fisher-Rao manifold, called a neuromanifold Ama16. Since 2018, an eponymous journal devoted to information geometry (INGE, https://www.springer.com/journal/41884) is published by Springer which reports the latest advances in the field.

References

[Ama16]
Shun-ichi Amari, Information geometry and its applications, Applied Mathematical Sciences, vol. 194, Springer, [Tokyo], 2016, DOI 10.1007/978-4-431-55978-8. MR3495836Show rawAMSref\bib{amari2016information}{book}{ author={Amari, Shun-ichi}, title={Information geometry and its applications}, series={Applied Mathematical Sciences}, volume={194}, publisher={Springer, [Tokyo]}, date={2016}, pages={xiii+374}, isbn={978-4-431-55977-1}, isbn={978-4-431-55978-8}, review={\MR {3495836}}, doi={10.1007/978-4-431-55978-8}, } Close amsref.
[Ama21]
Shun-ichi Amari, Information geometry, Jpn. J. Math. 16 (2021), no. 1, 1–48, DOI 10.1007/s11537-020-1920-5. MR4206647Show rawAMSref\bib{amari2021information}{article}{ author={Amari, Shun-ichi}, title={Information geometry}, journal={Jpn. J. Math.}, volume={16}, date={2021}, number={1}, pages={1--48}, issn={0289-2316}, review={\MR {4206647}}, doi={10.1007/s11537-020-1920-5}, } Close amsref.
[AKO18]
Shun-ichi Amari, Ryo Karakida, and Masafumi Oizumi, Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the entropy-relaxed transportation problem, Inf. Geom. 1 (2018), no. 1, 13–37, DOI 10.1007/s41884-018-0002-8. MR3974671Show rawAMSref\bib{IGWassersteinKLD-2018}{article}{ author={Amari, Shun-ichi}, author={Karakida, Ryo}, author={Oizumi, Masafumi}, title={Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the entropy-relaxed transportation problem}, journal={Inf. Geom.}, volume={1}, date={2018}, number={1}, pages={13--37}, issn={2511-2481}, review={\MR {3974671}}, doi={10.1007/s41884-018-0002-8}, } Close amsref.
[AJLS17]
Nihat Ay, Jürgen Jost, Hông Vân Lê, and Lorenz Schwachhöfer, Information geometry, Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics [Results in Mathematics and Related Areas. 3rd Series. A Series of Modern Surveys in Mathematics], vol. 64, Springer, Cham, 2017, DOI 10.1007/978-3-319-56478-4. MR3701408Show rawAMSref\bib{IG-Ay-2017}{book}{ author={Ay, Nihat}, author={Jost, J\"{u}rgen}, author={L\^{e}, H\^{o}ng V\^{a}n}, author={Schwachh\"{o}fer, Lorenz}, title={Information geometry}, series={Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics [Results in Mathematics and Related Areas. 3rd Series. A Series of Modern Surveys in Mathematics]}, volume={64}, publisher={Springer, Cham}, date={2017}, pages={xi+407}, isbn={978-3-319-56477-7}, isbn={978-3-319-56478-4}, review={\MR {3701408}}, doi={10.1007/978-3-319-56478-4}, } Close amsref.
[Čen82]
Nikolai Nikolaevich Čencov, Statistical decision rules and optimal inference, Translations of Mathematical Monographs, vol. 53, American Mathematical Soc., 1981.
[Efr75]
Bradley Efron, Defining the curvature of a statistical problem (with applications to second order efficiency), Ann. Statist. 3 (1975), no. 6, 1189–1242. MR428531Show rawAMSref\bib{efron1975defining}{article}{ author={Efron, Bradley}, title={Defining the curvature of a statistical problem (with applications to second order efficiency)}, journal={Ann. Statist.}, volume={3}, date={1975}, number={6}, pages={1189--1242}, issn={0090-5364}, review={\MR {428531}}, } Close amsref.
[Egu83]
Shinto Eguchi, Second order efficiency of minimum contrast estimators in a curved exponential family, Ann. Statist. 11 (1983), no. 3, 793–803. MR707930Show rawAMSref\bib{Eguchi-1983}{article}{ author={Eguchi, Shinto}, title={Second order efficiency of minimum contrast estimators in a curved exponential family}, journal={Ann. Statist.}, volume={11}, date={1983}, number={3}, pages={793--803}, issn={0090-5364}, review={\MR {707930}}, } Close amsref.
[GN14]
Leonor Godinho and José Natário, An introduction to Riemannian geometry: With applications to mechanics and relativity, Universitext, Springer, Cham, 2014, DOI 10.1007/978-3-319-08666-8. MR3289090Show rawAMSref\bib{godinho2012introduction}{book}{ author={Godinho, Leonor}, author={Nat\'{a}rio, Jos\'{e}}, title={An introduction to Riemannian geometry}, series={Universitext}, subtitle={With applications to mechanics and relativity}, publisher={Springer, Cham}, date={2014}, pages={x+467}, isbn={978-3-319-08665-1}, isbn={978-3-319-08666-8}, review={\MR {3289090}}, doi={10.1007/978-3-319-08666-8}, } Close amsref.
[Kee10]
Robert W. Keener, Theoretical statistics: Topics for a core course, Springer Texts in Statistics, Springer, New York, 2010, DOI 10.1007/978-0-387-93839-4. MR2683126Show rawAMSref\bib{keener2010theoretical}{book}{ author={Keener, Robert W.}, title={Theoretical statistics}, series={Springer Texts in Statistics}, subtitle={Topics for a core course}, publisher={Springer, New York}, date={2010}, pages={xviii+538}, isbn={978-0-387-93838-7}, review={\MR {2683126}}, doi={10.1007/978-0-387-93839-4}, } Close amsref.
[Li21]
Wuchen Li, Transport information Bregman divergences, arXiv:2101.01162, 2021.
[Nau11]
Jan Naudts, Generalised thermostatistics, Springer-Verlag London, Ltd., London, 2011, DOI 10.1007/978-0-85729-355-8. MR2777415Show rawAMSref\bib{naudts2011generalised}{book}{ author={Naudts, Jan}, title={Generalised thermostatistics}, publisher={Springer-Verlag London, Ltd., London}, date={2011}, pages={x+201}, isbn={978-0-85729-354-1}, review={\MR {2777415}}, doi={10.1007/978-0-85729-355-8}, } Close amsref.
[Nie20]
Frank Nielsen, An elementary introduction to information geometry, Entropy 22 (2020), no. 10, Paper No. 1100, 61, DOI 10.3390/e22101100. MR4221069Show rawAMSref\bib{nielsen2020elementary}{article}{ author={Nielsen, Frank}, title={An elementary introduction to information geometry}, journal={Entropy}, volume={22}, date={2020}, number={10}, pages={Paper No. 1100, 61}, review={\MR {4221069}}, doi={10.3390/e22101100}, } Close amsref.
[PC19]
Gabriel Peyré and Marco Cuturi, Computational optimal transport: With applications to data science, Foundations and Trends® in Machine Learning 11 (2019), no. 5-6, 355–607.
[Shi07]
Hirohiko Shima, The geometry of Hessian structures, World Scientific Publishing Co. Pte. Ltd., Hackensack, NJ, 2007, DOI 10.1142/9789812707536. MR2293045Show rawAMSref\bib{shima2007geometry}{book}{ author={Shima, Hirohiko}, title={The geometry of Hessian structures}, publisher={World Scientific Publishing Co. Pte. Ltd., Hackensack, NJ}, date={2007}, pages={xiv+246}, isbn={978-981-270-031-5}, isbn={981-270-031-5}, review={\MR {2293045}}, doi={10.1142/9789812707536}, } Close amsref.
[Won18]
Ting-Kam Leonard Wong, Logarithmic divergences from optimal transport and Rényi geometry, Inf. Geom. 1 (2018), no. 1, 39–78, DOI 10.1007/s41884-018-0012-6. MR4010746Show rawAMSref\bib{wong2018logarithmic}{article}{ author={Wong, Ting-Kam Leonard}, title={Logarithmic divergences from optimal transport and R\'{e}nyi geometry}, journal={Inf. Geom.}, volume={1}, date={2018}, number={1}, pages={39--78}, issn={2511-2481}, review={\MR {4010746}}, doi={10.1007/s41884-018-0012-6}, } Close amsref.

Frank Nielsen is a senior researcher (Fellow) of Sony Computer Science Laboratories, Inc., Tokyo, Japan. His email address is Frank.Nielsen@acm.org.

Article DOI: 10.1090/noti2403

Credits

The opening image is courtesy of ermess via Getty.

Figures 13 are courtesy of the author.

Photo of the author is courtesy of Maryse Beaumont.