A measure-theoretic version of the Dragomir-Jensen inequality

We extend Dragomir's refinement of Jensen's inequality from the dicrete to the general case, identifying the equality conditions.


Introduction
Suppose we have a probability space (Ω, A, P ), an integrable function X : Ω → R, and a real valued strictly convex function φ defined in some interval containing the range of X. By Jensen's inequality, with equality precisely when X(ω) ≡ E(X) for P -a.e. ω. Thus, the left hand side of (1) provides a measure of the spread of X around its mean value. Of course, in the important special case φ(t) = t 2 , the left hand side of (1) is just the variance of X. It is natural to study how these generalized variances change when the probability P varies. The case of discrete probability measures with finite support was considered by S. S. Dragomir in [Dra].
Here we note that Dragomir's clever proof (refining Jensen's inequality by using Jensen's inequality itself) can be used to extend his result to general probability spaces. Additionally, we identify the cases of equality when φ is strictly convex, and present some immediate consecuences.

The Dragomir-Jensen Inequality
To motivate the inequality, consider the following simple example: Let X and Y be random variables on (Ω, A, P ), with densities f X and f Y respectively, such that the quotient f Y /f X is well defined almost everywhere, or in other words, such that the distribution of Y, defined by the push-forward (or induced) probability Y * P (A) := P (Y ∈ A), is absolutely continuous with respect to X * P (A) := P (X ∈ A). If both X and Y have mean zero, then it is easy to bound the variance of Y in terms of the variance of X: Since and Var(X) = ∞ −∞ y 2 f X (y)dy, replacing f Y (y) f X (y) in (2) by its essential supremum and by its essential infimum, we get In fact, (3) holds whenever the expected values of X and Y are finite, and not just for φ(t) = t 2 , but for arbitrary convex functions. This is the content of the (two sided) Dragomir-Jensen inequality (cf. Theorem 2.1 and Corollary 2.9), which generalizes (3) much in the same way as Jensen's inequality generalizes Var(X) ≥ 0. Next we recall some basic notations and facts. Given an absolutely continuous measure P << Q, as usual, dP dQ denotes the Radon-Nikodym derivative of P with respect to Q, and dP dQ ∞ denotes its essential supremum (which could be infinite). We adopt the standard measure-theoretic convention ∞·0 = 0 (recall that under any other convention, the monotone convergence theorem would fail). Of course, integrals are used here in the Lebesgue sense. In particular, to have XdP well defined, it is assumed that either X + dP < ∞ or X − dP < ∞, where X + and X − respectively denote the positive and negative parts of X. Thus, we do not consider principal values.
Since we will be dealing with positive measures, and in fact, probabilities, by taking any representative of the Radon-Nikodym derivative and redefining it (if needed) on a set of measure zero, we may assume that 0 ≤ dP dQ < ∞ (we use the same notation for the Radon-Nikodym derivative and its representative).
Theorem 2.1. Let (Ω, A) be a measurable space, let P and Q be probability measures defined on (Ω, A) such that P << Q, let X : Ω → R be integrable both with respect to P and Q, and let the real valued function φ be convex in some (not necessarily bounded) interval containing the range of X. Then Regarding the equality conditions, to avoid trivialities we suppose that P = Q, and then we distinguish three cases. 1) Both sides of (4) take the value ∞ if and only if Ω φ(X)dP = ∞, and then either Next, assume that φ is strictly convex, and let A := ω ∈ Ω : dP dQ (ω) = dP dQ ∞ .
2) Both sides of (4) take the value 0 if and only if X is Q-a.e. constant.
3) Both sides of (4) take the same value a, with 0 < a < ∞, if and only if the following three conditions hold: a) Ω φ(X)dQ < ∞ and dP dQ ∞ < ∞.
b) There exists a constant c such that X ≡ c, Q-a.e. on Ω\A, but X is not Q-a.e. constant on Ω. c) Q(A) > 0, P (A) > 0, and c = Ω XdQ = 1 Q(A) A XdQ = Ω XdP = 1 P (A) A XdP . Example 2.2. Given (Ω, A, P ), A ⊂ X with P (A) > 0, and X : Ω → R integrable, it is intuitively obvious (and not difficult to prove directly) that the variance Var A (X) of X restricted to A, with respect to the conditional probability P A (B) := P (B|A), cannot be much larger than the original variance of X on all Ω, with respect to P . This is now a special case of the preceding theorem: Since dP A dP ∞ = 1 P (A) , setting φ(t) = t 2 in (4) we get Example 2.3. Given X : Ω → R and g : R → R, suppose we know the distribution of X, but there is some uncertainty about the value of one or several parameters, and we want to estimate how the variance of g(X) is affected by this uncertainty. The preceding theorem can be used to this end. Assume, for instance, that X ∼ N(0, σ), with 0 < a ≤ σ ≤ b. The zero mean assumption is made so the resulting expressions are simple, but the case where there is uncertainty both about the mean and the variance can be treated in the same way. Call Var 1 (g(X)) and Var 2 (g(X)) the variances obtained by supposing that X has standard deviations σ 1 and σ 2 respectively, with a ≤ σ 1 ≤ σ 2 ≤ b, and let P and Q be the corresponding laws for X. Then and hence Var 1 (g(X)) ≤ b a Var 2 (g(X)).
If we additionally know that g is compactly supported, we can present a simultaneous lower bound (in this regard, see also Corollary 2.9 below). Suppose, for simplicity, that the support of g is contained in [c, d], with 0 < c < d. Then Var 1 (g(X)), and hence Var 2 (g(X)) ≤ Var 1 (g(X)).
Remark 2.4. It is natural to ask whether the hypothesis P << Q in the theorem, must be imposed on all of Ω, or it is sufficient to consider just the set {X = 0}. To see that this is not enough, let Ω = [0, 1], let A be the Lebesgue sets, and let dP = dx. Take X = χ [1/2,1] , and dQ := 2χ [1/2,1] dx. Then dP dQ = 1 2 on {X = 0}, so restricted to {X = 0}, P << Q and dP dQ is bounded. Let φ(x) = x 2 . Since X is constant a.e. with respect to Q, the right hand side of (4) is zero, while the left hand side is just 1/2 − 1/4. Even if we extend dP dQ to all [0, 1] by setting dP dQ = ∞ on (0, 1/2), the ∞ × 0 = 0 convention entails that (4) does not extend to pairs P, Q when P has a singular part.
Remark 2.5. The equality conditions, for the original Dragomir's result, where established in [Al]. In the more general measure-theoretic setting, dealing only with bounded functions would be too restrictive, so the possibility that some term equals ∞ must be considered. Related to this issue is the fact that, since we are working on a probability space, the L p norm of any function is monotone increasing in p. It is thus natural to ask whether the Dragomir-Jensen inequality can be improved by replacing dP dQ ∞ with c p dP dQ p , for some On the other hand, dP dQ ∞ = ∞ and 0 < Ω φ(X)dQ − φ Ω XdQ , so this is an instance where both sides of the Dragomir-Jensen inequality take the value ∞.
To make the structure of the proof of the theorem more transparent, we place some measuretheoretic details into two technical lemmas.
Lemma 2.6. Let (Ω, A) be a measurable space, let P and Q be probability measures defined on (Ω, A) such that P << Q, let dP dQ be essentially bounded, and let Then for every measurable set B SinceÃ contains ∅, is closed under complementation, and also under countable unions, it is a σ-algebra. Additionally, {y} is measurable, for ∅ ∈ A, whence {y} = ∅ ∪ {y} ∈Ã. Let us check thatP is a probability. ClearlyP (Ω) = 1; to see thatP is non-negative, we use the change of variable formula Ω XdP = Ω X dP dQ dQ, and we conclude that for every B ∈Ã, we haveP Note next that whenever B ⊂ Ω is measurable, (5) reduces to In particular, A := w ∈ Ω : dP dQ (w) = dP dQ ∞ hasP -measure zero, since by (6), Proof of the Theorem. Note that if dP dQ is unbounded (i.e., if dP dQ ∞ = ∞) then the right hand side of (4) is ∞ unless Ω φ(X)dQ = φ Ω XdQ , in which case its value is zero (by the standard convention). So for (4) to hold we need to have Ω φ(X)dP = φ Ω XdP . This would be trivial if φ were strictly convex, for then X would be constant Q-a.e., and thus also constant P -a.e. since P << Q. But for φ not strictly convex, equality may occur for non-constant functions, and so a different argument is needed. What we do is to assume first that dP dQ ∞ < ∞, and then handle the unbounded case via an approximation argument. Let us suppose also that both Ω φ(X)dQ < ∞ and Ω φ(X)dP < ∞. Then inequality (4) is equivalent to simply by rearranging terms. But (7) immediately follows from the usual Jensen's inequality, applied to the probability measure space (Ω,Ã,P ) defined in Lemma 2.7, and to a suitable extension of X, given byX := X on Ω, andX(y) := Ω XdP . The functionX :Ω → R is clearlyÃ measurable. Furthermore, so by Jensen's inequality on (Ω,Ã,P ), It follows fromX(y) = Ω XdP that {y} φ X dδ y = φ X (y) = φ Ω XdP , so (7) holds, and hence so does (4) when every term appearing there is finite. Suppose next that dP dQ ∞ = ∞, and that Ω φ(X)dQ = φ Ω XdQ (for otherwise (4) is trivial). Let us emphasize that we do not a priori assume that Ω φ(X)dP < ∞; this will follow once Ω φ(X)dP = φ Ω XdP is proven. Note however that Ω (φ(X)) − dP < ∞, since Ω φ(X)dP ≥ φ Ω XdP > −∞. Define dP n := min dP dQ , n dQ, and observe that by the monotone convergence theorem, applied to min dP dQ , n , and separately to X + min dP dQ , n , X − min dP dQ , n , (φ(X)) + min dP dQ , n , and (φ(X)) − min dP dQ , n , we have Since dPn dQ ∞ = n, we know from Jensen's inequality and the previous case that Thus, for every n we have Since all the limits involved exist, letting n → ∞ and using the continuity of φ in the interior of its domain I, we conclude that (this is stated below as a Corollary, as it seems to be of independent interest). Note that the interval I might contain one (or both) of its endpoints, say it contains the endpoint a, and φ might be discontinuous there. But if Ω XdP = a, since the range of X is contained in I, we have X = a P -a.e., and thus (13) also holds in this case. So far, we know that if Ω φ(X)dP < ∞ and Ω φ(X)dQ < ∞, then inequality (4) holds, regardless of whether or not dP dQ is essentially bounded, and additionally, that if Ω φ(X)dQ = φ Ω XdQ , then Ω φ(X)dP = φ Ω XdP . Furthermore, it is clear that if Ω φ(X)dP = ∞ and dP dQ ∞ < ∞, we must have Ω φ(X)dQ = ∞, so all we need to show is that dP dQ ∞ = ∞ whenever Ω φ(X)dP = ∞ and Ω φ(X)dQ < ∞. This latter inequality, together with Ω φ(X)dQ ≥ φ Ω XdQ > −∞, entail that both the positive and negative parts of φ(X) are Q-integrable. Suppose, towards a contradiction, that dP In view of the fact that inequality (4) holds under no restrictions, the first equality case, where both sides take the value ∞, follows immediately.
Next, suppose that φ is strictly convex. By the equality case of Jensen's inequality the right hand side of (4) is zero (hence, so is the left hand side) if and only if X is constant Q-a.e.; thus, statement 2) of the theorem is true.
Finally, suppose there exists a constant a ∈ (0, ∞) such that Then a) is immediate; using (4) we conclude that Ω φ(X)dP is also finite, so equality on (4) is equivalent to having equality on (7), which in turn is equivalent to φ ΩX dP = Ω φ X dP . Now X is not constant Q-a.e. on Ω (by the equality case in Jensen's inequality, otherwise a would be zero) butX is constantP -a.e. by the equality case in Jensen's inequality, call this constant c. Since by Lemma 2.7,P and Q are mutually absolutely continuous over the set Ω \ A, statement b) follows.
To see why c) holds, note first that if Q(A) = 0, then by b) X ≡ c Q-a.e. and we are in the case 0 = 0. But a > 0, so Q(A) > 0, and hence P (A) > 0 by Lemma 2.6. Since P = Q, we also haveP (Ω \ A) > 0. UsingX ≡ cP -a.e. together with (8)-(9), we see that forP -a.e. (or Q-a.e.) w ∈ Ω \ A, Since X = c a.e. (with respect to all measures under consideration) on Ω \ A, The fact that is obtained in the same way. Alternatively, since Ω XdQ = Ω XdP and X = c a.e. on X \ A, we see that (16) holds if and only if (17) does. Suppose now that a), b) and c) hold. Using a) and b) we conclude that the right hand side of (4) is neither zero nor infinity. Since all terms involved are finite, equality on (4) is equivalent to φ ΩX dP = Ω φ X dP , which holds ifX isP -a.e. constant onΩ. We prove this next. First,X ≡ cP -a.e. on Ω \ A by b) and Lemma 2.7, while by definition and by c),X(y) = Ω XdP = c. SinceP (A) = 0 by Lemma 2.7, the result follows.
The next corollary was obtained as a step in the preceding proof, and of course, it also follows directly from the statement of Theorem 2.1.
We state next the version of Theorem 2.1 (where for simplicity we omit the analogous equality conditions) that provides a lower bound (instead of an upper bound) using the essential infimum of the Radon-Nykodim derivative (instead of the essential supremum). Recall that if P and Q are mutually absolutely continuous and h is any representative of dP dQ , then 1/h is a representative of dQ dP .
Corollary 2.9. Let (Ω, A) be a measurable space, let P and Q be probability measures defined on (Ω, A) such that P << Q, let X : Ω → R be integrable both with respect to P and Q, and let the real valued function φ be convex in some interval containing the range of X. Then Proof. If ess inf dP dQ = 0, then inequality (18) reduces to the usual Jensen inequality, so only when ess inf dP dQ > 0 does (18) say something new. But in this case we have Q << P , with dQ dP = 1 dP/dQ a bounded function, since dQ dP ∞ = 1 ess inf dP dQ < ∞. Now inequality (18) follows from (4): Multiply both sides of (18) by dQ dP ∞ , and note that this is just (4) with the roles of P and Q interchanged.
Since in the nontrivial case ess inf dP dQ > 0 the preceding corollary reduces to Theorem 2.1, the corresponding equality conditions follow automatically.
If φ is concave, then applying Theorem 2.1 and Corollary 2.9 to −φ we obtain the following Corollary 2.10. Let (Ω, A) be a measurable space, let P and Q be probability measures defined on (Ω, A) such that P << Q, let X : Ω → R be integrable both with respect to P and Q, and let the real valued function φ be concave in some interval containing the range of X. Then Next we state the corresponding refinement of the measure-theoretic version of the arithmetic-geometric mean inequality exp Ω log(X)dP ≤ Ω XdP, thereby generalizing [Al, Theorem 2.1].
Corollary 2.11. Let (Ω, A) be a measurable space, let P << Q be probability measures defined on (Ω, A), and let X : Ω → [0, ∞) be such that log Ω is integrable both with respect to P and Q. Then Let us finish by saying that the reader interested in applications of the Dragomir-Jensen inequality to information inequalities, can find some such applications (for the discrete case) in the original Dragomir's paper [Dra].