parameter estimation as a ultimate approach

Almost any problem in statistics can be framed as a parameter estimation problem.

bayes box with a whole continuum of hypotheses and a dislike of RV terminology

a set of possible values with associated probabilities is what is know a probability distribution
we move from a bayesian problems with a few prior probabilities to using a prior distribution
one could say that the parameter is now a random variable however the author of the paper doesn’t like the term random because it has connotations of something fluctuating….
… whereas it is really our uncertainty

bus example

this example has some specifics and has some generalities… should be good practice anyhow

setting out the problem:

moved to Auckland
takes the bus to work each day
not confident with the new bus system
first week just took the first bus that came along and heading towards city
first week caught 5 morning buses
3 of them took him far away with an extra 20 minutes walk to work
infe propotion of buses that are “good”

\[\theta = \text{proportion of buses that are good}\]

infer \(\theta\) using a bayesian framework

what do we know about \(\theta\) ? well it is between 0 and 1

if we make a discrete approximation that is could be any of the following:

\[\{0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1\}\]

lets create bayes box knowing that we have 11 possible values for parameter and give a uniform distribution.

DataFrame 1

	prior	likelihood	marginal_likelihood_i_think	posterior
0	0.000000
1	0.100000
2	0.200000
3	0.300000
4	0.400000
5	0.500000
6	0.600000
7	0.700000
8	0.800000
9	0.900000
10	1.000000
Total	5.500000

the code to produce this table

calculating the likelihood is binomial distribtution

\[P(X \mid \theta) = \binom{N}{x} \theta^x (1-\theta)^{N-x}\]

i wont go through all the calculations now as need to save time… but theta = 0 and theta =1 have zero likelihood.

theta = 0.4 which is same as 2/5 is highest likelihood. it doesnt necessarily mean theta = 0.4 is most likely value for theta since that assertion also depends on prior…. but given we have assumed a uniform prior in this case it does.

sampling distribution and likelihood

we used the binomial but it is only the binomial with assumption about paramter and the data applied that gives the liklihood.
the binomial itself is called the sampling distribution.

from the rosetta stone i believe the likelihood is sometimes called the sampling distribution but perhaps it is more nuanced ….

missed a prior

we actually assumed N=5 is a prior without explicitly stating it.

predicting the future

three different priors with the binmonal bus problem

Example

upto page 34 - the beta distribution

the three different priors are :

prior 1 is a uniform prior
prior 2 empahsises the extremes

it is a actually a criticims of the uniform prior that it doesn’t give very high probability to extreme values.

\[f(\theta \mid \alpha, \beta) = \frac{\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}{B(\alpha, \beta)} \quad \text{for } 0 < \theta < 1, \, \alpha > 0, \, \beta > 0\]

prior 3 assumes that we have quite a lot of information (in our beliefs) and that we think the parameter is close to the value of 0.5

\[\theta^{100} (1 - \theta)^{100}\]

colloquially we can refer to them as “emphasising the extremes” prior and “informative” prior.

the three priors we used were all examples of the beta distribution the beta distribution is a family of distributions it is interesting how we get the denominatory of the beta distribution… it is basically an integral of the numerator… which is a normalising contant to bring the sum of all probabilities to 1

although the integral can also be related to factorials as well apparently 3

apparently in our bus example the posterior disributions are also beta distributions

apparently this is made possible by the beta form of the prior and the binomial likelihood

\[E(X) = \frac{\alpha}{\alpha + \beta}\]

a prior has a result on the conclusions

sometimes it has a big impact sometimes it doesn’t

if there is a lot of data , the prior tends to not matter as much

in the bus example they show that with 1000 reulsts (rather than 5) the resulting posterior is very similar under all 3 priors

some advice:

when results are sensitive to choice of prior…. the data isn’t very informative so either get loads more data… or think very hard about your prior

we now move onto chapter 6 … Summarising the Prior Distribution

In descriptive statistics, you often make summaries of a complex data set (e.g. the mean and the standard deviation) so that you can communicate about the data set in a concise way. In Bayesian statistics, you often do a similar thing, but instead of giving a concise description of the data, you give a concise description of the posterior distribution.

point estimates - are single number guesses for a parameter value

a rule for guessing a point estimate … is called an estimator !!!

(like the maximum likelihood estimator - i suppose… although maybe that is not the right term for a bayesian posterior distribution)

estimates normally have a hat over the parameter.

three methods for for determing an estimator:

the posterior mean – the expected value of the parameter
the posterior median – the value that divides the probability in half
the posterior mode – the peak of the distribution

some advice , before moving to a formal way of determining what a good estimate looks like :

if the posterior looks even vaguely like a normal distribution then give:

theta = posterior mean plus/minus standard deviation

we are upto page 38