parameter estimation
common situation in statistics
almost any problem in statistics can be framed as a parameter estimation problem.
bayes box with a whole continuum of hypotheses
att set of possible values with associated probabilities is what is know a probability distribution
we move from a bayesian problems with a few prior probabilities to using a prior distribution
one could say that the parameter is now a random variable however the author of the paper doesn’t like the term random because it has connotations of something fluctuating…. … whereas is this circumstance it really refers to our uncertainty
bus example
has some specifics and has some generalities… should be good practice anyhow
setting out the problem:
- moved to Auckland
- takes the bus to work each day
- not confident with the new bus system
- first week just took the first bus that came along and heading towards city
- first week caught 5 morning buses
- 3 of them took him far away with an extra 20 minutes walk to work
- infe propotion of buses that are “good”
infer \(\theta\) using a bayesian framework
what do we know about \(\theta\) ? well it is between 0 and 1
if we make a discrete approximation that is could be any of the following:
\[\{0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1\}\]lets create bayes box knowing that we have 11 possible values for parameter and give a uniform distribution.
DataFrame 1
| prior | likelihood | marginal_likelihood_i_think | posterior | |
|---|---|---|---|---|
| 0 | 0.000000 | |||
| 1 | 0.100000 | |||
| 2 | 0.200000 | |||
| 3 | 0.300000 | |||
| 4 | 0.400000 | |||
| 5 | 0.500000 | |||
| 6 | 0.600000 | |||
| 7 | 0.700000 | |||
| 8 | 0.800000 | |||
| 9 | 0.900000 | |||
| 10 | 1.000000 | |||
| Total | 5.500000 |
the code to produce this table
calculating the likelihood is binomial distribtution
\[P(X \mid \theta) = \binom{N}{x} \theta^x (1-\theta)^{N-x}\]i wont go through all the calculations now as need to save time… but theta = 0 and theta =1 have zero likelihood.
theta = 0.4 which is same as 2/5 is highest likelihood. it doesnt necessarily mean theta = 0.4 is most likely value for theta since that assertion also depends on prior…. but given we have assumed a uniform prior in this case it does.
sampling distribution and likelihood
we used the binomial but it is only the binomial with assumption about paramter and the data applied that gives the liklihood.
the binomial itself is called the sampling distribution.
from the rosetta stone i believe the likelihood is sometimes called the sampling distribution but perhaps it is more nuanced ….
missed a prior
we actually assumed N=5 is a prior without explicitly stating it.
predicting the future
three different priors with the binmonal bus problem

upto page 34 - the beta distribution
the three different priors are :
- prior 1 is a uniform prior
- prior 2 empahsises the extremes
it is a actually a criticims of the uniform prior that it doesn’t give very high probability to extreme values.
\[f(\theta \mid \alpha, \beta) = \frac{\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}{B(\alpha, \beta)} \quad \text{for } 0 < \theta < 1, \, \alpha > 0, \, \beta > 0\]- prior 3 assumes that we have quite a lot of information (in our beliefs) and that we think the parameter is close to the value of 0.5
colloquially we can refer to them as “emphasising the extremes” prior and “informative” prior.
the three priors we used were all examples of the beta distribution the beta distribution is a family of distributions it is interesting how we get the denominatory of the beta distribution… it is basically an integral of the numerator… which is a normalising contant to bring the sum of all probabilities to 1
although the integral can also be related to factorials as well apparently 3
apparently in our bus example the posterior disributions are also beta distributions
apparently this is made possible by the beta form of the prior and the binomial likelihood
\[E(X) = \frac{\alpha}{\alpha + \beta}\]a prior has a result on the conclusions
sometimes it has a big impact sometimes it doesn’t
if there is a lot of data , the prior tends to not matter as much
in the bus example they show that with 1000 reulsts (rather than 5) the resulting posterior is very similar under all 3 priors
some advice:
when results are sensitive to choice of prior…. the data isn’t very informative so either get loads more data… or think very hard about your prior
we now move onto chapter 6 … Summarising the Prior Distribution
In descriptive statistics, you often make summaries of a complex data set (e.g. the mean and the standard deviation) so that you can communicate about the data set in a concise way. In Bayesian statistics, you often do a similar thing, but instead of giving a concise description of the data, you give a concise description of the posterior distribution.
point estimates - are single number guesses for a parameter value
a rule for guessing a point estimate … is called an estimator !!!
(like the maximum likelihood estimator - i suppose… although maybe that is not the right term for a bayesian posterior distribution)
estimates normally have a hat over the parameter.
three methods for for determing an estimator:
- the posterior mean – the expected value of the parameter
- the posterior median – the value that divides the probability in half
- the posterior mode – the peak of the distribution
some advice , before moving to a formal way of determining what a good estimate looks like :
if the posterior looks even vaguely like a normal distribution then give:
theta = posterior mean plus/minus standard deviation
we are upto page 38