Bayesian vs. Frequentist Inference

Computational Mathematics and Statistics Camp University of Chicago September 2018

Bayes’ theorem

\[P(B|A) = \frac{P(A|B) \times P(B)}{P(A)}\]

Coin tossing

\(H_1 =\) “first toss is heads”
\(H_A =\) “all 5 tosses are heads”
What is \(P(H_1 | H_A)\)?

Coin tossing

\(P(H_1) = \frac{1}{2}\)
\(P(H_A) = \frac{1}{32}\)
\(P(H_A | H_1) = \frac{1}{16}\)

\[P(H_A | H_1) = \frac{P(H_A | H_1) \times P(H_1)}{P(H_A)} = \frac{\frac{1}{16} \times \frac{1}{2}}{\frac{1}{32}} = 1\]

Radar detection problem

If an aircraft is present in a certain area, a radar correctly registers its presence with probability \(0.99\). If it is not present, the radar falsely registers an aircraft presence with probability \(0.10\). We assume that an aircraft is present with probability \(0.05\). What is the probability that an airplane is present and the radar correctly detects it?

Radar detection problem

\(A = {\text{an aircraft is present}}\)
\(B = {\text{the radar registers an aircraft presence}}\)
\(P(A) = 0.05\)
\(P(B| A = 1) = 0.99\)
\(P(B| A = 0) = 0.1\)
\(P(A|B)\)

\[ \begin{align} P(A|B) & = \frac{P(A) \times P(B|A)}{P(B)} \\ & = \frac{P(A) \times P(B|A)}{P(A) \times P(B|A) + P(A = 0) \times(B | A = 0)} \\ & = \frac{0.05 \times 0.99}{0.05 \times 0.99 + 0.95 \times 0.1} \\ & \approx 0.3426 \end{align} \]

False positive fallacy

Disease test is 95% accurate
A random person drawn from a certain population has probability 0.001 of having the disease. Given that the person just tested positive, what is the probability of having the disease?

False positive fallacy

\(A = {\text{person has the disease}}\)
\(B = {\text{test result is positive for the disease}}\)
\(P(A) = 0.001\)
\(P(B | A) = 0.95\)
\(P(B | A = 0) = 0.05\)

\[ \begin{align} P(A|B) & = \frac{P(A) \times P(B|A)}{P(B)} \\ & = \frac{P(A) \times P(B|A)}{P(A) \times P(B|A) + P(A = 0) \times(B | A = 0)} \\ & = \frac{0.001 \times 0.95}{0.001 \times 0.95 + 0.999 \times 0.05} \\ & = 0.0187 \end{align} \]

Bayesian inference

Inferential statistics
Hypothesis
Data

\[P(\text{H is true}| \text{data}) = \frac{P(\text{data} | \text{H is true}) \times P(\text{H is true})}{P(\text{data})}\]

Disease screening redux

\[ \begin{align} P(A|B) & = \frac{P(A) \times P(B|A)}{P(B)} \\ & = \frac{P(A) \times P(B|A)}{P(A) \times P(B|A) + P(A = 0) \times(B | A = 0)} \\ & = \frac{0.001 \times 0.95}{0.001 \times 0.95 + 0.999 \times 0.05} \\ & = 0.0187 \end{align} \]

Old probability: \(0.001\)
New probability: \(0.0187\)

Coin tossing example

Type \(A\) coins are fair, with \(p = 0.5\) of heads
Type \(B\) coins are bent, with \(p = 0.6\) of heads
Type \(C\) coins are bent, with \(p = 0.9\) of heads
Drawer of 5 coins
- 2 of type \(A\)
- 2 of type \(B\)
- 1 of type \(C\)
What is the probability a randomly drawn coin is type \(A\) given the first flip is heads? Type \(B\)? Type \(C\)?

Terminology

\(A\), \(B\), and \(C\)
\(D\)
Find \(P(A|D), P(B|D), P(C|D)\) using Bayes’ theorem

Terminology

Experiment
Data
- \(D = \text{heads}\)
Hypotheses
Prior probability

\[P(A) = 0.4, P(B) = 0.4, P(C) = 0.2\]
Likelihood

\[P(D|A) = 0.5, P(D|B) = 0.6, P(D|C) = 0.9\]
Posterior probability

\[P(A|D), P(B|D), P(C|D)\]

Find posteriors

\[P(A|D) = \frac{P(D|A) \times P(A)}{P(D)}\] \[P(B|D) = \frac{P(D|B) \times P(B)}{P(D)}\] \[P(C|D) = \frac{P(D|C) \times P(C)}{P(D)}\]

Find posteriors

\[ \begin{align} P(D) & = P(D|A) \times P(A) + P(D|B) \times P(B) + P(D|C) \times P(C) \\ & = 0.5 \times 0.4 + 0.6 \times 0.4 + 0.9 \times 0.2 \\ & = 0.62 \end{align} \]

Find posteriors

\[P(A|D) = \frac{P(D|A) \times P(A)}{P(D)} = \frac{0.5 \times 0.4}{0.62} = \frac{0.2}{0.62}\]

\[P(B|D) = \frac{P(D|B) \times P(B)}{P(D)} = \frac{0.6 \times 0.4}{0.62} = \frac{0.24}{0.62}\]

\[P(C|D) = \frac{P(D|C) \times P(C)}{P(D)} = \frac{0.9 \times 0.2}{0.62} = \frac{0.18}{0.62}\]

hypothesis	prior	likelihood	Bayes numerator	posterior
\(H\)	\(P(H)\)	\(P(D\mid H)\)	\(P(D \mid H) \times P(H)\)	\(P(H \mid D)\)
A	0.4	0.5	0.2	0.3226
B	0.4	0.6	0.24	0.3871
C	0.2	0.9	0.18	0.2903
total	1		0.62	1

Bayes theorem

\[P(\text{hypothesis}| \text{data}) = \frac{P(\text{data} | \text{hypothesis}) \times P(\text{hypothesis})}{P(\text{data})}\]

\[P(H|D) = \frac{P(D | H) \times P(H)}{P(D)}\]

\[P(\text{hypothesis}| \text{data}) \propto P(\text{data} | \text{hypothesis}) \times P(\text{hypothesis})\]

\[\text{posterior} \propto \text{likelihood} \times \text{prior}\]

Probability functions

\(\theta\)
\(p(\theta)\)
\(p(\theta | D)\)
\(p(D | \theta)\)

Terror attack example

September 11th attacks in NYC
Probability of a terror attack on tall buildings in Manhattan before the first plane hit
Probability of a plane hitting the WTC by accident
What is the probability of terrorists crashing planes into Manhattan skyscrapers given the first plane hitting the World Trade Center?

First plane strike

Probability that terrorists would crash planes into Manhattan skyscrapers
Probability of plane hitting if terrorists are attacking Manhattan
Probability of plane hitting if terrorists are not attacking Manhattan skyscrapers (i.e. an accident)

Posterior after first plane strike

\(A =\) terror attack
\(B =\) plane hitting the World Trade Center
\(P(B|A) =\) probability of a terrorist crashing a plane into Manhattan skyscrapers
\(P(A) =\) probability of a plane hitting if terrorists are attacking Manhattan skyscrapers
\(P(B| A = 0) =\) probability of a plane hitting if terrorists are not attacking Manhattan skyscrapers

\[ \begin{align} P(A|B) &= \frac{P(B|A) \times P(A)}{P(B)} \\ &= \frac{P(B|A) \times P(A)}{P(B|A) \times P(A) + P(B| A = 0) \times P(A=0)} \\ & = \frac{0.005 \times 1}{0.005 \times 1 + 0.008 \times 0.995} \\ & \approx 0.38 \end{align} \]

Posterior after second plane strike

\[ \begin{align} P(A|B) &= \frac{P(B|A) \times P(A)}{P(B)} \\ &= \frac{P(B|A) \times P(A)}{P(B|A) \times P(A) + P(B| A = 0) \times P(A=0)} \\ & = \frac{1 \times 0.38}{01 \times 0.38 + 0.008 \times 0.62} \\ & \approx .9999 \end{align} \]

Choosing priors

Where do priors come from?
Uniform (flat) prior
Other probability distributions
Posterior probability distributions
Effect on posterior

\[P(\text{hypothesis}| \text{data}) = \frac{P(\text{data} | \text{hypothesis}) \times P(\text{hypothesis})}{P(\text{data})}\]

\[\text{posterior} \propto \text{likelihood} \times \text{prior}\]

Probability intervals

PMF \(p(\theta)\) or PDF \(f(\theta)\)
\(p\)-probability interval for \(\theta\)
- An interval \([a,b]\) with \(P(a \leq \theta \leq b) = p\)
Credible intervals
Not the same as confidence intervals

Probability intervals

I think \(\theta\) is between 0.45 and 0.65 with 50% probability.

I think \(\theta\) follows a \(\text{Beta}(8, 6)\) distribution.

Frequentist inference

\[P(H|D) = \frac{P(D | H) \times P(H)}{P(D)}\]

When the prior is known
When the prior is not known
- Bayesians
- Frequentists
  
  \[L(H | D) = P(D|H)\]
What is the meaning of probability?
Bayesians - probability of hypotheses and data
Frequentists - probability of data given a hypothesis

Null hypothesis testing

Assume an hypothesis is true
Observed data sampled from that distribution
Probability of observing data given hypothesis
\(t\)-tests
\(p\)-values

Confidence intervals

Confidence intervals \(\neq\) credible intervals
Confidence intervals describe variability in the data \(D\) for a fixed parameter \(\theta\)
- “With a large number of repeated samples, 95% of such calculated confidence intervals would include the true value of the parameter”
Credible intervals describe variability in the parameter \(\theta\) for a fixed data \(D\)
- “There is a 90% chance that the parameter falls between these two values”

Comparison

Bayesian inference
- Uses probabilities for both hypotheses and data
- Depends on the prior and likelihood of observed data
- Requires one to know or construct a “subjective prior”
- May be computationally intensive due to integration over many parameters
Frequentist inference
- Never uses or gives the probability of a hypothesis
- Depends on the likelihood \(P(D|H)\) for both observed and unobserved data
- Does not require a prior
- Tends to be less computationally intensive

Critique of Bayesian inference

The subjective prior is subjective
Philosphical difference of opinion over probability

Defense of Bayesian inference

The probability of hypotheses is exactly what we need to make decisions
Bayes’ theorem is logically rigorous
Test and compare different priors
Easy to communicate a result framed in terms of probabilities of hypotheses
Priors can be defended based on the assumptions made to arrive at it
Evidence derived from the data is independent of notions about “data more extreme” that depend on the exact experimental setup
Data can be used as it comes in

Critique of frequentist inference

It is ad-hoc and does not carry the force of deductive logic
Experiments need to be fully specified ahead of time
p-values and significance levels are notoriously prone to misinterpretation

Defense of frequentist inference

It is objective
The hypothesis testing process is applied in a consistent and rigorous fashion
Frequentist experimental design demands a careful description of the experiment and methods of analysis before starting
The frequentist approach has been used for over 100 years

p-value test

You run a two-sample \(t\)-test for equal means, with \(\alpha = 0.05\) and obtain a p-value of 0.04. What are the odds that the two samples are drawn from distributions with the same mean?

\(\frac{19}{1}\)
\(\frac{1}{19}\)
\(\frac{1}{20}\)
\(\frac{1}{24}\)

p-value test

With a p-value of less than 0.05, we reject the null hypothesis that the difference in means between the two samples is zero.

There is a 95% probability that the difference in means between the two samples falls between \([-.02, .03]\).

Computing Bayesian models

Closed form analytic solutions
Complex models/multivariate probability density/mass functions
Computational approaches
- Markov chain Monte Carlo (MCMC)
- Gibbs sampling
Mechanics are more difficult to learn and explain