A full proof of Berry-Esseen inequality in the Central Limit Theorem

A full proof of Berry-Esseen inequality in the Central Limit Theorem

[Aim of this article: I will provide you a full proof of the Berry-Esseen theorem which I successfully proved it after investing two hours. This theorem gives us the maximum convergence limit of the basin of attraction in the Central limit theorem.]

[Note: I want to note here from the very beginning, this post is a bit technical. But I’m hoping that this will very helpful who is very needy of it. This proof is based on the book by W. Feller, “An Introduction to Probability and it’s application” and is only for identically independent distributed summands. Thus, I won’t prove the non-identical case because this post is a way longer. Please find it’s proof in Feller’s book.]

Central limit theorem concern with the situation that the limit distribution of the normalized sum is normal as the sample size goes to infinity. But the question you may raise is, “What is the rate of convergence of normalized sum distribution to the standard normal distribution?“. Let’s answer this question by considering the case where the samples to be identical. To be more precise, let’s state like this:

Let x_{k} be independent variable with identical(or common) distribution F such that, E(x_{k})=0, E(x_{k}^{2})=\sigma^{2}>0, E(|x_{k}|^{3})=\rho<\infty and, let F_{n} stands for the distribution of the normalized sum \frac{x_{1}+x_{2}+\ldots+x_{n}}{\sigma\sqrt{n}}. Then for all x and n, the supremum of convergence between F_{n}(x) and \phi(x) i.e. standard normal distribution is |F_{n}(x)-\phi(x)|\leq \frac{3\rho}{\sigma^{3}\sqrt{n}} .

Looks very boring! Right? Okay, let’s start with the history of the Central limit theorem(CLT).

The first proof of CLT was given by French mathematician Pierre-Simon Laplace in 1810. After fourteen years later, French mathematician Siméon-Denis Poisson improved it and provided us a more general form of proof. Laplace and his contemporaries were very interested in this theorem because they see the importance of it in repeated measurements of the same quantity. And thus they realized the individual measurements could be viewed as approximately independent and identically distributed, then their mean could be approximated by a normal distribution. Because this statistical plus probability theorem states that for a given sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population regardless of the shape of the proposed distribution.

Then, the first convergence rate for CLT was estimated by Russian mathematician Aleksandr M. Lyapunov. But, the more refined version of proof is independently discovered by two mathematicians Andrew C. Berry (in 1941) and Carl-Gustav Esseen (in 1942), who then, continuously refined the convergence theorem of CLT and hence, given this theorem which is named as “Berry-Esseen theorem”. The best thing about this theorem is that it only considered first three moments.

Now, I think you are very eager to know about the proof of this theorem. Right? Let’s get started without any further delay!

From Feller’s book lemma 1 section 3.13 at Page 538, the upper bound between F_{n}(x) and \phi(x) is

|F_{n}(x)-\phi(x)| \leq \frac{1}{\pi}\int_{-T}^{T}|\frac{\psi(\xi) - \gamma(\xi)}{\xi}| d\xi + \frac{24m}{\pi T} \rule{2cm}{0.4pt}  (1)


\psi(\xi) = characteristics function for F_{n}(x) which equals to \psi^{n}(\frac{\xi}{\sigma\sqrt{n}}),

\gamma(\xi) = characteristics function for \phi(x) which equals to e^{\frac{-\xi^{2}}{2}},

m = maximum growth rate for \phi such that |\phi^{'}(x)| \leq m < \infty.

The above expression can be found by starting with Fourier’s methods.  And our proposed proof is based on smoothing inequality (refer Feller’s paper at section 3.5) such that,

T = \frac{4}{3}\frac{\sigma^{3} \sqrt{n}}{\rho} \leq \frac{4 \sqrt{n}}{3}.

The last inequality is the result of moment inequality. And, the normal density \phi has maximum m < \frac{2}{5}  (I really don’t know why Feller chose this bound. I would be very happy if you guys could help me in this quest!).

So equation (1) becomes,

\pi |F_{n}(x) - \phi(x)| &\leq \int_{-T}^{T}|\psi^{n}(\frac{\xi}{\sigma\sqrt{n}}) - e^{\frac{-\xi^{2}}{2}}|\frac{d\xi}{|\xi|} + \frac{24\times 2}{T\times 5}

=\int_{-T}^{T}|\psi^{n}(\frac{\xi}{\sigma\sqrt{n}}) - e^{\frac{-\xi^{2}}{2}}|\frac{d\xi}{|\xi|} + \frac{9.6}{T} \rule{2cm}{0.4pt}  (2)

Now, Let’s find |\psi^{n}(\frac{\xi}{\sigma\sqrt{n}}) - e^{\frac{-\xi^{2}}{2}}| = ?

Isn’t it looks like the reverse triangle inequality with exponent “n“? I mean this

|\alpha^{n} - \beta^{n}| \leq n|\alpha - \beta|\Gamma^{n-1} if |\alpha| \leq \Gamma, |\beta| \leq \Gamma.

Thus, we can say

|\psi^{n}(\frac{\xi}{\sigma\sqrt{n}}) - (e^{\frac{-\xi^{2}}{2n}})^{n}| \leq n |\psi(\frac{\xi}{\sigma\sqrt{n}}) - e^{\frac{-\xi^{2}}{2n}}|\Gamma^{n-1} \rule{2cm}{0.4pt}  (3)

, if  |\psi(\frac{\xi}{\sigma\sqrt{n}})| \leq \Gamma, |e^{\frac{-\xi^{2}}{2n}}| \leq \Gamma.

Again, let’s make our problem much simpler by proposing |\psi(\frac{\xi}{\sigma\sqrt{n}}) - e^{\frac{-\xi^{2}}{2n}}| =?

First of all, let’s suppose t = \frac{\xi}{\sigma\sqrt{n}} so that \psi(\frac{\xi}{\sigma\sqrt{n}}) = \psi(t). Thus, \xi = t\sigma\sqrt{n} so that e^{\frac{-\xi^{2}}{2n}} = e^{\frac{-t^{2}\sigma^{2}n}{2n}} = e^{\frac{-t^{2}\sigma^{2}}{2}}.

Look! How beautiful this looks like:

|\psi(\frac{\xi}{\sigma\sqrt{n}}) - e^{\frac{-\xi^{2}}{2n}}| = |\psi(t) - e^{\frac{-t^{2}\sigma^{2}}{2}}|

= |\psi(t) - (1 - \frac{t^{2}\sigma^{2}}{2} + \ldots)| , putting the series of e^{\frac{-t^{2}\sigma^{2}}{2}}

= |\psi(t) - 1 + \frac{t^{2}\sigma^{2}}{2}| \rule{2cm}{0.4pt}  (4)

,  neglecting the higher order terms because for large n then, t \to 0.

The characteristics function for \psi(t) is

\psi(t) = \int_{-\infty}^{\infty} e^{i t x} F_{n}(x) dx.

From the very first, I said as the sample size goes on increasing the shape of the curve of proposed distribution tends to match up with the normal curve. I mean this

Illustration of the difference of cumulative distribution alluded to in the theorem, adapted from Wikipedia

Isn’t the smoothing concept looks like Taylor’s theorem? Exactly! Like Taylor theorem said, we can approximate any curve to a well-defined curve by a series expression. Likewise, we can estimate our proposed distribution with standard normal distribution by taking higher order terms. So, we will need to go like this

e^{i t x} = 1 + i t x - \frac{t^{2} x^{2}}{2!} + \sum_{d = 3}^{\infty} \frac{t^{d} x^{d}}{d!}

or, \sum_{d = 3}^{\infty} \frac{t^{d} x^{d}}{d!} =e^{i t x} - 1 - i t x + \frac{t^{2} x^{2}}{2!}.

Now, multiply by F_{n}(x) and do integration both sides with the limit -\infty to \infty i.e.

\int_{-\infty}^{\infty} (\sum_{d = 3}^{\infty} \frac{t^{d} x^{d}}{d!})F_{n}(x) dx =\int_{-\infty}^{\infty}  (e^{i t x} - 1 - i t x + \frac{t^{2} x^{2}}{2!}) F_{n}(x) dx \rule{2cm}{0.4pt}  (5).

From characteristics property, the subtraction of two characteristics function gives another characteristics function and also we suppose, the result can be approximated by taking the higher order series. This is our trick:

|\psi(t) - 1 + \frac{t^{2}\sigma^{2}}{2}| \approx |\int_{-\infty}^{\infty} \sum_{d = 3}^{\infty} \frac{t^{d}x^{d}}{d!} F_{n}(x) dx|

= |\int_{-\infty}^{\infty} (e^{i t x} - 1 - i t x + \frac{t^{2}x^{2}}{2})F_{n}(x) dx| \rule{2cm}{0.4pt}  (6)

, from equation (5).

Also, another inequality we can suppose is this:

(e^{i t x} - 1 - i tx + \frac{t^{2}x^{2}}{2!} + \ldots - \frac{(i t x)^{n-1}}{(n-1)!}) \leq \frac{(x t)^{n}}{n!}.

For n = 3,

(e^{i t x} - 1 - i tx + \frac{t^{2}x^{2}}{2!}) \leq \frac{(x t)^{3}}{3!}.

So, the equation (6) becomes

|\psi(t) - 1 + \frac{t^{2}\sigma^{2}}{2}| \leq |\int_{-\infty}^{\infty} \frac{(xt)^{3}}{6} F_{n}(x) dx|.

In the left part of this inequality, we’re going to apply the Cauchy-Schwarz inequality as

|\psi(t) - 1 + \frac{t^{2}\sigma^{2}}{2}| \leq |\psi(t)| + |\frac{t^{2}\sigma^{2}}{2}|.

Then, this will turn into

|\psi(t) - 1 + \frac{t^{2}\sigma^{2}}{2}| \leq|\psi(t)| + |\frac{t^{2}\sigma^{2}}{2}| \leq|\int_{-\infty}^{\infty} \frac{(xt)^{3}}{6} F_{n}(x) dx| where the third part of the inequality has higher value than others.

For our need, we will use

|\psi(t)| + |\frac{t^{2}\sigma^{2}}{2}| \leq|\int_{-\infty}^{\infty} \frac{(xt)^{3}}{6} F_{n}(x) dx|

\Rightarrow |\psi(t)| + (\frac{t^{2}\sigma^{2}}{2}) \leq \frac{|t|^{3}}{6} |\int_{-\infty}^{\infty} x^{3} F_{n}(x) dx|, if \sigma > 0, and second part is from Cauchy-Schwarz inequality

\Rightarrow |\psi(t)| + (\frac{t^{2}\sigma^{2}}{2}) \leq \frac{|t|^{3}}{6} \int_{-\infty}^{\infty} |x^{3} F_{n}(x)| dx, applying the properties of Riemann integral in second part

\Rightarrow |\psi(t)| + (\frac{t^{2}\sigma^{2}}{2}) \leq \frac{|t|^{3}}{6} \int_{-\infty}^{\infty} |x^{3}| |F_{n}(x)| dx, applying Cauchy-Schwarz inequality.

= \frac{|t|^{3}}{6} \times E(|x_{k}|^{3})

= \frac{|t|^{3}}{6} \times \rho) such that \rho < \infty

\therefore |\psi(t)| \leq 1 - \frac{t^{2}\sigma^{2}}{2} + \frac{1}{6} \rho |t|^{3}.

Returning back the value of t = \frac{\xi}{\sigma\sqrt{n}}. we get,

|\psi(\frac{\xi}{\sigma\sqrt{n}})| \leq 1 - \frac{\xi^{2}}{2n} + \frac{\rho}{6} \times \frac{|\xi|^{3}}{\sigma^{3} n^{3/2}} \rule{2cm}{0.4pt}  (7).

Now, we conclude |\xi| \leq T to smooth our proposed PDF. So that we can use |\xi| = T = \frac{4}{3}\frac{\sigma^{3}\sqrt{n}}{\rho}. So, the equation (7) becomes

|\psi(\frac{\xi}{\sigma\sqrt{n}})| \leq 1 - \frac{\xi^{2}}{2n} + \frac{\rho}{6\sigma^{3}n^{3/2}}|\xi|^{2} |\xi|

= 1 - \frac{\xi^{2}}{2n} +\frac{\rho}{6\sigma^{3}n^{3/2}}|\xi|^{2} \times (\frac{4\sigma^{3}\sqrt{n}}{3\rho})

= 1- \frac{\xi^{2}}{2n} + \frac{4\xi^{2}}{18n}

= 1 - \frac{5\xi^{2}}{18n}

\therefore |\psi(\frac{\xi}{\sigma\sqrt{n}})| \leq e^{\frac{-5\xi^2}{18n}} \rule{2cm}{0.4pt}  (8)

, converting into exponential form with n \to \infty.

We know, \sigma^{3} < \rho the assertion of the theorem is trivially true for \sqrt{n} \leq 3 and hence we may assume n\geq 10.

We taking exponent n-1 both side in equation (8). We can get,

|\psi(\frac{\xi}{\sigma\sqrt{n}})|^{n-1} \leq e^{\frac{-5\xi^2}{18n}\times (n-1)}

Thus, for n = 10,

|\psi(\frac{\xi}{\sigma\sqrt{n}})|^{n-1} \leq e^{\frac{-5\xi^2}{18\times 10}\times (10-1)} = e^{\frac{-\xi^{2}}{4}} \rule{2cm}{0.4pt}  (9).

Let me remind you equation (3) with maximum equality i.e.

|\psi^{n}(\frac{\xi}{\sigma\sqrt{n}}) - (e^{\frac{-\xi^{2}}{2n}})^{n}| = n |\psi(\frac{\xi}{\sigma\sqrt{n}}) - (e^{\frac{-\xi^{2}}{2n}})| \Gamma^{n-1} if |\psi(\frac{\xi}{\sigma\sqrt{n}})| = \Gamma

So, the right part of equation (9) may serve for the bound \Gamma^{n-1} i.e.

|\psi(\frac{\xi}{\sigma\sqrt{n}})| = \Gamma

or, |\psi(\frac{\xi}{\sigma\sqrt{n}})|^{n-1} = \Gamma^{n-1}

or, e^{\frac{-\xi^{2}}{4}} = \Gamma^{n-1}

Thus, we can have

|\psi^{n}(\frac{\xi}{\sigma\sqrt{n}}) - (e^{\frac{-\xi^{2}}{2n}})^{n}| = n |\psi(\frac{\xi}{\sigma\sqrt{n}}) - (e^{\frac{-\xi^{2}}{2n}})|e^{\frac{-\xi^{2}}{4}} \rule{2cm}{0.4pt}  (10).

Also, we need formulate one more inequality. Let’s start from this:

e^{-x} \leq 1 - x + \frac{x^{2}}{2} for x > 0

\Rightarrow e^{-x} - 1 + x \leq \frac{x^{2}}{2} \rule{2cm}{0.4pt}  (11).

Oh! I almost forgot. We need to construct something very useful. i.e.

n |\psi(\frac{\xi}{\sigma\sqrt{n}}) - e^{\frac{-\xi^{2}}{2n}}| \leq n|\psi(\frac{\xi}{\sigma\sqrt{n}}) - 1 + \frac{\xi^{2}}{2n}}| + n|1 - \frac{\xi^{2}}{2n} - e^{\frac{-\xi^{2}}{2n}}|, I have added two terms and applied triangle inequality.

= First term + Second term \rule{2cm}{0.4pt}  (12)

which means,

First term = n|\psi(\frac{\xi}{\sigma\sqrt{n}}) - 1 + \frac{\xi^{2}}{2n}}| \leq \frac{\rho |\xi|^{3}}{6 \sigma^{3}n^{3/2}}\times n = \frac{\rho |\xi|^{3}}{6 \sigma^{3}n^{1/2}}, from equation (7).


Second term = n|1 - \frac{\xi^{2}}{2n} - e^{\frac{-\xi^{2}}{2n}}| \leq n \times \frac{1}{2} \times \frac{\xi^{4}}{(2n)^{2}} = \frac{1}{8n}\xi^{4}, from equation (11).

Returning the above results in equation (12). we get,

n |\psi(\frac{\xi}{\sigma\sqrt{n}}) - e^{\frac{-\xi^{2}}{2n}}| \leq\frac{\rho |\xi|^{3}}{6 \sigma^{3}n^{1/2}} +\frac{1}{8n}\xi^{4}.

Since \sqrt{n} > 3, the above inequality should follow the integrand (2) which means

\pi |F_{n}(x) - \phi(x)| \leq \int_{-T}^{T} |\psi^{n}(\frac{\xi}{\sigma\sqrt{n}}) - (e^{\frac{-\xi^{2}}{2n}})^{n}| \frac{d\xi}{|\xi|} + \frac{9.6}{T}

\leq \int_{-T}^{T}n |\psi(\frac{\xi}{\sigma\sqrt{n}}) - (e^{\frac{-\xi^{2}}{2n}})| e^{\frac{-\xi^{2}}{4}} \frac{d\xi}{|\xi|} + \frac{9.6}{T}, from equation (10)

\leq \int_{-T}^{T} (\frac{\rho |\xi|^{3}}{6 \sigma^{3}n^{1/2}} +\frac{1}{8n}\xi^{4})\times \frac{1}{|\xi|} \times e^{\frac{-\xi^{2}}{4}} d\xi + \frac{9.6}{T}

= \int_{-T}^{T} \frac{\rho}{6\sigma^{3}\sqrt{n}} |\xi|^{2} e^{\frac{-\xi^{2}}{4}} d\xi + \int_{-T}^{T} \frac{\xi^{3}}{8n} e^{\frac{-\xi^{2}}{4}} d\xi + \frac{9.6}{T} \rule{2cm}{0.4pt}  (13).

Also, we know T = \frac{4\sigma^{3}\sqrt{n}}{3\rho}. So,

\frac{9.6}{T} =\frac{9.6\times 3\rho}{4\sigma^{3}\sqrt{n}} =\frac{36\rho}{5\sigma^{3}\sqrt{n}}.

From equation (13), let’s consider for n \to \infty then, T \to \infty thus,

I_{1} = \int_{-\infty}^{\infty} |\xi|^{2} e^{\frac{-\xi^{2}}{4}} d\xi = 4\sqrt{\pi}


I_{2} = \int_{-\infty}^{\infty} \xi^{3} e^{\frac{-\xi^{2}}{4}} d\xi = 0.

These above integrations can be done by using by-parts rule and also from Gamma function. But, I found a difficulty when solving on \int_{-a}^{a} |\xi|^{2} e^{\frac{-\xi^{2}}{4}} d\xi. If you guys solve it, please comment your solution. Thanks in advance!

So, equation (13) becomes

\pi |F_{n}(x) - \phi(x)| \leq \frac{\rho}{6 \sigma^{3} \sqrt{n}} \times 4\sqrt{\pi} + \frac{36\rho}{5\sigma^{3}\sqrt{n}} = 8.382\times\frac{\rho}{\sigma^{3} \sqrt{n}}.

\therefore |F_{n}(x) - \phi(x)| \leq 2.668\times\frac{\rho}{\sigma^{3} \sqrt{n}}

For simplicity, we use below as final form:

|F_{n}(x) - \phi(x)| \leq \frac{3\rho}{\sigma^{3} \sqrt{n}}   


If you guys have some questions, comments, or insults then, please don’t hesitate to shot me an email or comment below.

Want to share this post?