Central-limit-theorem

Central Limit Theorem

The central limit theorem is a key result in probability theory that helps explain why normal, or Gaussian, distributions are so omnipresent. The setup is that you have distributions for $N$ random variables $x_i$ and you want to know the distribution of $q = \sum_{i=1}^{N} x_i$. Think of each $x_i$ as coming from it's own distribution like in the figure below. For instance, $x_1$ might be the weight of spoons, $x_2$ the weight of forks, $x_3$ the weight of bowls, ..., and $x_N$ of plates in your kitchen. Then $q$ would represent the total weight when you have one of each of those objects. The distribution of weights for each object might be weird because you have some mix-and-match set of silverware from your parents, grandparents, IKEA, and the thrift shop. The central limit theorem says that if you have enough objects (i.e. $N$ is large), then $q$ has a normal (Gaussian) distribution.

Moreover, the central limit theorem states that the mean value of $q$ is given by

\begin{equation} \mu_{q} = \sum_{i=1}^{N} \mu_{x_i} \end{equation}

and the variance (standard deviation squared) is given by

\begin{equation} \sigma_{q}^{2} = \sum_{i=1}^{N} \sigma^2_{x_i} \end{equation}

if you are having problems with the math displaing, click here

The mean probably isn't surprising because $q$ is just a sum and the integral the defines the mean just distributes across each term. Also, the equation for the variance is the same as the propagation of errors formula we use when we add different measurements together. However, that propoagation of errors formula is derived from the Central Limit Theorem, not vice versa.

This is a collaborative project

In this repository there is a folder called distributions with several python files. The idea is that each student will create one of these python files and we will use GitHub to collect them. Each of these files has a python class that describes a probability distribution. Each of these classes will define:

x_min, x_max, f_max - used for the accept/reject Monte Carlo sampling algorithm
pdf() - the probability density function
mean() - the mean of the pdf
std() - the standard deviation of the pdf

In addition, each of these python classes inherits from BaseDistribution which knows how to run the accept/reject algorithm using the information above (see inside ). In order to generate n_samples from the pdf, you simply call dist.rvs(n_samples), where dist is an instance of one of these python classes.

Naming Convention: Name your file Dist_<netid>.py and your distribution the same way (without the .py). If you want to contribute more than one distribution, you can can add Dist_<netid>_<index>.py, where <index> is 1,2,3,...

Here's an example:

!cat distributions/Dist_kc90.py

import numpy as np
from .base_distribution import BaseDistribution

class Dist_kc90(BaseDistribution):
	def __init__(self):
		self.f_max = 1
		self.x_min = -1
		self.x_max = 1


	def pdf(self, x):
		"""This is your PDF"""
		return np.abs(x)

	def mean(self):
		"""This is the mean of the PDF"""
		return 0.

	def std(self):
		"""This is the standard deviation of the pdf"""
		return np.sqrt(0.5)

Example usage of the distributions

Ok, now let's use them. So far there are only two distributions, but there will be more soon.

%pylab inline --no-import-all
from scipy.stats import norm #will use this for plotting

Populating the interactive namespace from numpy and matplotlib

# import all our distributions
import distributions

# some funny python to make a list of all the distributions for convenience
all_distributions_dict = dict([(name, cls) for name, cls in distributions.__dict__.items() if isinstance(cls, type)])
all_distributions_list = [(cls) for name, cls in distributions.__dict__.items() if isinstance(cls, type)]
# print out their names
all_distributions_dict.keys()

dict_keys(['Dist_rdm445', 'Dist_pr1392', 'Dist_pw1091', 'Dist_omr234', 'Dist_ap5312', 'Dist_os852', 'Dist_phh250', 'Dist_kc90', 'Dist_fh828', 'Dist_pme240', 'Dist_knd286', 'Dist_yz4244', 'Dist_mm7253', 'Dist_cas955', 'Dist_lwp226', 'Dist_lac683', 'Dist_fay221', 'Dist_jnt299'])

len(all_distributions_dict.keys())

## Do tests
ok_distributions_list=[]
problems=[]
for i, cls in enumerate(all_distributions_list):
    #print(cls)
    try:
        dist = cls()
        N_test = 100000
        #print('will try to generate for %s' %(cls.__name__))
        if dist.pdf(dist.x_min + .3*(dist.x_max-dist.x_min)) < 1E-3:
            print("may have a problem")
            continue
            
        rvs = dist.rvs(N_test)
        if np.abs(np.mean(rvs) - dist.mean()) > 5*np.std(rvs)/np.sqrt(N_test):
            print("means don't match for %s: %f vs. %f" %(cls.__name__, 
                                                          np.mean(rvs), dist.mean()))
            continue
            
        elif np.abs(np.std(rvs) - dist.std()) > 5*np.std(rvs)/np.sqrt(np.sqrt(1.*N_test)):
            print("std devs. don't match for %s: %f vs. %f" %(cls.__name__, 
                                                          np.std(rvs), dist.std()))
            continue
        elif np.abs(np.std(rvs) - dist.std()) / dist.std() > 0.1:
            print("std devs. don't match for %s: %f vs. %f" %(cls.__name__, 
                                                          np.std(rvs), dist.std()))
            continue
        
        elif np.sum(dist.pdf(np.linspace(dist.x_min,dist.x_max,100))<0) > 0:
            print("pdf was negative in some places")
            continue                    

        else:
            print("%s passes tests, adding it" %(cls.__name__))
            ok_distributions_list.append(cls)
    except:
        print("%s has errors, does't work" %(cls.__name__))
        continue

print("list of ok distributions:",[i.__name__ for i in ok_distributions_list])

Dist_os852 passes tests, adding it
means don't match for Dist_pr1392: 0.732852 vs. 0.000000
Dist_pw1091 passes tests, adding it
means don't match for Dist_ap5312: 0.721174 vs. 0.800000
Dist_kc90 passes tests, adding it
std devs. don't match for Dist_fh828: 0.217878 vs. 0.181690
means don't match for Dist_yz4244: 0.833356 vs. 0.166667
means don't match for Dist_mm7253: 0.097384 vs. 0.119000
std devs. don't match for Dist_jnt299: 0.040570 vs. 0.001646
Dist_rdm445 passes tests, adding it
Dist_phh250 passes tests, adding it
std devs. don't match for Dist_pme240: 0.095205 vs. 0.040377
Dist_knd286 passes tests, adding it
means don't match for Dist_omr234: 0.665732 vs. 1.000000
means don't match for Dist_cas955: 0.621786 vs. 0.000000
means don't match for Dist_lwp226: 0.332858 vs. 2.500000
Dist_lac683 passes tests, adding it
means don't match for Dist_fay221: 0.499351 vs. 1.500000
list of ok distributions: ['Dist_os852', 'Dist_pw1091', 'Dist_kc90', 'Dist_rdm445', 'Dist_phh250', 'Dist_knd286', 'Dist_lac683']

problems = [x for x in all_distributions_list if x not in ok_distributions_list]
[i.__name__ for i in problems]

['Dist_pr1392',
 'Dist_ap5312',
 'Dist_fh828',
 'Dist_yz4244',
 'Dist_mm7253',
 'Dist_jnt299',
 'Dist_pme240',
 'Dist_omr234',
 'Dist_cas955',
 'Dist_lwp226',
 'Dist_fay221']

# how many samples for plots?
n_samples = 100000

# Here's an example of usage
dist = distributions.Dist_kc90()
rvs = dist.rvs(n_samples)
counts, bins, edges = plt.hist(rvs, bins=50, normed=True, alpha =0.2)
y = []
for bin in bins:
    y.append(dist.pdf(bin))
plt.plot(bins, y, c='r', lw=2)

[<matplotlib.lines.Line2D at 0x103343ba8>]

dist.std()

0.70710678118654757

Let's inspect all the distributions we have

Here we will loop over the different distributions and make a plot like the one above

fig = plt.figure(figsize=2*plt.figaspect(len(ok_distributions_list)))
for i, cls in enumerate(ok_distributions_list):
    dist = cls()
    rvs = dist.rvs(10000)
    ax = fig.add_subplot(len(ok_distributions_list),1,i+1)
    counts, bins, patches = ax.hist(rvs, bins=50, normed=True, alpha=0.2)
    # depending on how you code up your pdf() function, numpy might not
    # be able to "broadcast" (apply the function for each element of a numpy array)
    # but this should always wrok
    y = []
    for bin in bins:
        y.append(dist.pdf(bin))
        
    # ok, now plot it
    plt.plot(bins, y, c='r', lw=2)
    
    # and let's check if the distribution is ok
    print("%s: std from samples = %.2f, std from dist = %.2f" %(cls.__name__,np.std(dist.rvs(n_samples)), dist.std()))
    if np.abs(np.std(dist.rvs(n_samples)) - dist.std()) / dist.std() > 0.1:
        print("looks like a problem with this distribution: ", cls)

Dist_os852: std from samples = 0.66, std from dist = 0.65
Dist_pw1091: std from samples = 0.88, std from dist = 0.89
Dist_kc90: std from samples = 0.71, std from dist = 0.71
Dist_rdm445: std from samples = 0.71, std from dist = 0.71
Dist_phh250: std from samples = 0.24, std from dist = 0.24
Dist_knd286: std from samples = 0.31, std from dist = 0.31
Dist_lac683: std from samples = 0.58, std from dist = 0.54

Demonstration of the Central Limit Theorem

ok, let's use one of the distributions to demonstrate the central limit theorem. We will use the same distribution $N$ times.

First let's make a little helper function.

def do_convolution(dist, N):
    n_samples = 10000
    q = np.zeros(n_samples)
    var_q = 0.
    mean_q = 0.

    for i in range(N):
        q = q+dist.rvs(n_samples)
        var_q = var_q + dist.std()**2
        mean_q = mean_q + dist.mean()

    std_q = np.sqrt( var_q )


    counts, bins, patches = plt.hist(q,bins=50, normed=True, alpha=.2)
    plt.plot(bins, norm.pdf(bins,loc=mean_q, scale=std_q), lw=2, c='r')

Now let's use it for $N=\{2,4,32\}$

dist = distributions.Dist_kc90()
do_convolution(dist,2)

do_convolution(dist,4)

do_convolution(dist,32)

Gorgeous!

Now let's do the same thing randomly using different distributions

To do this we will use np.random.choice to randomly choose from our list. Here's an example

np.random.choice(['a','b','c','d'], 10)

array(['d', 'c', 'd', 'c', 'a', 'a', 'a', 'c', 'b', 'a'],
      dtype='<U1')

Now let's make a variation on the helper function above

def do_random_convolution(list_of_distributions, N):
    n_samples = 10000
    q = np.zeros(n_samples)
    var_q = 0.
    mean_q = 0.

    for dist_class in np.random.choice(list_of_distributions,N):
        dist = dist_class()
        print(dist_class.__name__, dist.std())
        q = q+dist.rvs(n_samples)
        var_q = var_q + dist.std()**2
        mean_q = mean_q + dist.mean()

    std_q = np.sqrt( var_q )


    counts, bins, patches = plt.hist(q,bins=50, normed=True, alpha=.2)
    plt.plot(bins, norm.pdf(bins,loc=mean_q, scale=std_q), lw=2, c='r')

do_random_convolution(ok_distributions_list,2)

Dist_kc90 0.707106781187
Dist_phh250 0.235702260396

do_random_convolution(ok_distributions_list,4)

Dist_os852 0.6545248448388085
Dist_pw1091 0.88669
Dist_os852 0.6545248448388085
Dist_rdm445 0.707106781187

do_random_convolution(ok_distributions_list,4)

Dist_kc90 0.707106781187
Dist_os852 0.6545248448388085
Dist_knd286 0.307054709393
Dist_os852 0.6545248448388085

do_random_convolution(ok_distributions_list,32)

Dist_phh250 0.235702260396
Dist_rdm445 0.707106781187
Dist_phh250 0.235702260396
Dist_os852 0.6545248448388085
Dist_lac683 0.539449062475
Dist_phh250 0.235702260396
Dist_os852 0.6545248448388085
Dist_os852 0.6545248448388085
Dist_phh250 0.235702260396
Dist_rdm445 0.707106781187
Dist_phh250 0.235702260396
Dist_kc90 0.707106781187
Dist_os852 0.6545248448388085
Dist_phh250 0.235702260396
Dist_lac683 0.539449062475
Dist_phh250 0.235702260396
Dist_lac683 0.539449062475
Dist_kc90 0.707106781187
Dist_pw1091 0.88669
Dist_knd286 0.307054709393
Dist_os852 0.6545248448388085
Dist_kc90 0.707106781187
Dist_os852 0.6545248448388085
Dist_knd286 0.307054709393
Dist_phh250 0.235702260396
Dist_rdm445 0.707106781187
Dist_phh250 0.235702260396
Dist_pw1091 0.88669
Dist_pw1091 0.88669
Dist_rdm445 0.707106781187
Dist_knd286 0.307054709393
Dist_pw1091 0.88669

Part 2 of project

a) preliminaries

From either master or the branch you used to submit your distribution, update so that you have the current version of cranmer/intro-exp-phys-II.

You can either do this in GitHub desktop by finding the button near the top left or by typing this:

git fetch cranmer master
git merge cranmer/master

Now Create a new branch called "part2"

b) execute the notebook above

c) Make a $\chi^2$ function

Below is a copy of the do_random_convolution function with a new name. Modify it so taht it returns the chi-square quantity that says how closely the observed distribution matches the prediction from the Central Limit theorem.

YOu might want to check out the chi-square-of-distribution notebook

def do_random_convolution_with_chi2(list_of_distributions, N):
    n_samples = 100000
    q = np.zeros(n_samples)
    var_q = 0.
    mean_q = 0.

    for dist_class in np.random.choice(list_of_distributions,N):
        dist = dist_class()
        q = q+dist.rvs(n_samples)
        var_q = var_q + dist.std()**2
        mean_q = mean_q + dist.mean()

    std_q = np.sqrt( var_q )


    counts, bins, patches = plt.hist(q,bins=50, normed=True, alpha=.2)
    plt.plot(bins, norm.pdf(bins,loc=mean_q, scale=std_q), lw=2, c='r')

d) Make a plot

Plot the $\chi^2$ quantity vs. N for N=2,4,8,16,32

e) push a new version of this notebook to GitHub

Execute the notebook, get your plots, save it, commit the changes to the part2 branch, and then push (sync) to GitHub.com. When you are done, make a new pull request.