Bayesian vs. Frequentist Inference

Computational Mathematics and Statistics Camp University of Chicago September 2018

Bayes’ theorem

\[P(B|A) = \frac{P(A|B) \times P(B)}{P(A)}\]

Coin tossing

  • \(H_1 =\) “first toss is heads”
  • \(H_A =\) “all 5 tosses are heads”
  • What is \(P(H_1 | H_A)\)?

Coin tossing

  • \(P(H_1) = \frac{1}{2}\)
  • \(P(H_A) = \frac{1}{32}\)
  • \(P(H_A | H_1) = \frac{1}{16}\)

    \[P(H_A | H_1) = \frac{P(H_A | H_1) \times P(H_1)}{P(H_A)} = \frac{\frac{1}{16} \times \frac{1}{2}}{\frac{1}{32}} = 1\]

Radar detection problem

If an aircraft is present in a certain area, a radar correctly registers its presence with probability \(0.99\). If it is not present, the radar falsely registers an aircraft presence with probability \(0.10\). We assume that an aircraft is present with probability \(0.05\). What is the probability that an airplane is present and the radar correctly detects it?

Radar detection problem

  • \(A = {\text{an aircraft is present}}\)
  • \(B = {\text{the radar registers an aircraft presence}}\)
  • \(P(A) = 0.05\)
  • \(P(B| A = 1) = 0.99\)
  • \(P(B| A = 0) = 0.1\)
  • \(P(A|B)\)

    \[ \begin{align} P(A|B) & = \frac{P(A) \times P(B|A)}{P(B)} \\ & = \frac{P(A) \times P(B|A)}{P(A) \times P(B|A) + P(A = 0) \times(B | A = 0)} \\ & = \frac{0.05 \times 0.99}{0.05 \times 0.99 + 0.95 \times 0.1} \\ & \approx 0.3426 \end{align} \]

False positive fallacy

  • Disease test is 95% accurate
  • A random person drawn from a certain population has probability 0.001 of having the disease. Given that the person just tested positive, what is the probability of having the disease?

False positive fallacy

  • \(A = {\text{person has the disease}}\)
  • \(B = {\text{test result is positive for the disease}}\)
  • \(P(A) = 0.001\)
  • \(P(B | A) = 0.95\)
  • \(P(B | A = 0) = 0.05\)

    \[ \begin{align} P(A|B) & = \frac{P(A) \times P(B|A)}{P(B)} \\ & = \frac{P(A) \times P(B|A)}{P(A) \times P(B|A) + P(A = 0) \times(B | A = 0)} \\ & = \frac{0.001 \times 0.95}{0.001 \times 0.95 + 0.999 \times 0.05} \\ & = 0.0187 \end{align} \]

Bayesian inference

  • Inferential statistics
  • Hypothesis
  • Data

    \[P(\text{H is true}| \text{data}) = \frac{P(\text{data} | \text{H is true}) \times P(\text{H is true})}{P(\text{data})}\]

Disease screening redux

\[ \begin{align} P(A|B) & = \frac{P(A) \times P(B|A)}{P(B)} \\ & = \frac{P(A) \times P(B|A)}{P(A) \times P(B|A) + P(A = 0) \times(B | A = 0)} \\ & = \frac{0.001 \times 0.95}{0.001 \times 0.95 + 0.999 \times 0.05} \\ & = 0.0187 \end{align} \]

  • Old probability: \(0.001\)
  • New probability: \(0.0187\)

Coin tossing example

  • Type \(A\) coins are fair, with \(p = 0.5\) of heads
  • Type \(B\) coins are bent, with \(p = 0.6\) of heads
  • Type \(C\) coins are bent, with \(p = 0.9\) of heads
  • Drawer of 5 coins
    • 2 of type \(A\)
    • 2 of type \(B\)
    • 1 of type \(C\)
  • What is the probability a randomly drawn coin is type \(A\) given the first flip is heads? Type \(B\)? Type \(C\)?

Terminology

  • \(A\), \(B\), and \(C\)
  • \(D\)
  • Find \(P(A|D), P(B|D), P(C|D)\) using Bayes’ theorem

Terminology

  • Experiment
  • Data
    • \(D = \text{heads}\)
  • Hypotheses
  • Prior probability

    \[P(A) = 0.4, P(B) = 0.4, P(C) = 0.2\]

  • Likelihood

    \[P(D|A) = 0.5, P(D|B) = 0.6, P(D|C) = 0.9\]

  • Posterior probability

    \[P(A|D), P(B|D), P(C|D)\]

Find posteriors

\[P(A|D) = \frac{P(D|A) \times P(A)}{P(D)}\] \[P(B|D) = \frac{P(D|B) \times P(B)}{P(D)}\] \[P(C|D) = \frac{P(D|C) \times P(C)}{P(D)}\]

Find posteriors

\[ \begin{align} P(D) & = P(D|A) \times P(A) + P(D|B) \times P(B) + P(D|C) \times P(C) \\ & = 0.5 \times 0.4 + 0.6 \times 0.4 + 0.9 \times 0.2 \\ & = 0.62 \end{align} \]

Find posteriors

\[P(A|D) = \frac{P(D|A) \times P(A)}{P(D)} = \frac{0.5 \times 0.4}{0.62} = \frac{0.2}{0.62}\]

\[P(B|D) = \frac{P(D|B) \times P(B)}{P(D)} = \frac{0.6 \times 0.4}{0.62} = \frac{0.24}{0.62}\]

\[P(C|D) = \frac{P(D|C) \times P(C)}{P(D)} = \frac{0.9 \times 0.2}{0.62} = \frac{0.18}{0.62}\]

hypothesis prior likelihood Bayes numerator posterior
\(H\) \(P(H)\) \(P(D\mid H)\) \(P(D \mid H) \times P(H)\) \(P(H \mid D)\)
A 0.4 0.5 0.2 0.3226
B 0.4 0.6 0.24 0.3871
C 0.2 0.9 0.18 0.2903
total 1 0.62 1

Bayes theorem

\[P(\text{hypothesis}| \text{data}) = \frac{P(\text{data} | \text{hypothesis}) \times P(\text{hypothesis})}{P(\text{data})}\]

\[P(H|D) = \frac{P(D | H) \times P(H)}{P(D)}\]

\[P(\text{hypothesis}| \text{data}) \propto P(\text{data} | \text{hypothesis}) \times P(\text{hypothesis})\]

\[\text{posterior} \propto \text{likelihood} \times \text{prior}\]

Probability functions

  • \(\theta\)
  • \(p(\theta)\)
  • \(p(\theta | D)\)
  • \(p(D | \theta)\)

Terror attack example

  • September 11th attacks in NYC
  • Probability of a terror attack on tall buildings in Manhattan before the first plane hit
  • Probability of a plane hitting the WTC by accident
  • What is the probability of terrorists crashing planes into Manhattan skyscrapers given the first plane hitting the World Trade Center?

First plane strike

  • Probability that terrorists would crash planes into Manhattan skyscrapers
  • Probability of plane hitting if terrorists are attacking Manhattan
  • Probability of plane hitting if terrorists are not attacking Manhattan skyscrapers (i.e. an accident)

Posterior after first plane strike

  • \(A =\) terror attack
  • \(B =\) plane hitting the World Trade Center
  • \(P(B|A) =\) probability of a terrorist crashing a plane into Manhattan skyscrapers
  • \(P(A) =\) probability of a plane hitting if terrorists are attacking Manhattan skyscrapers
  • \(P(B| A = 0) =\) probability of a plane hitting if terrorists are not attacking Manhattan skyscrapers

    \[ \begin{align} P(A|B) &= \frac{P(B|A) \times P(A)}{P(B)} \\ &= \frac{P(B|A) \times P(A)}{P(B|A) \times P(A) + P(B| A = 0) \times P(A=0)} \\ & = \frac{0.005 \times 1}{0.005 \times 1 + 0.008 \times 0.995} \\ & \approx 0.38 \end{align} \]

Posterior after second plane strike

\[ \begin{align} P(A|B) &= \frac{P(B|A) \times P(A)}{P(B)} \\ &= \frac{P(B|A) \times P(A)}{P(B|A) \times P(A) + P(B| A = 0) \times P(A=0)} \\ & = \frac{1 \times 0.38}{01 \times 0.38 + 0.008 \times 0.62} \\ & \approx .9999 \end{align} \]

Choosing priors

  • Where do priors come from?
  • Uniform (flat) prior
  • Other probability distributions
  • Posterior probability distributions
  • Effect on posterior

    \[P(\text{hypothesis}| \text{data}) = \frac{P(\text{data} | \text{hypothesis}) \times P(\text{hypothesis})}{P(\text{data})}\]

    \[\text{posterior} \propto \text{likelihood} \times \text{prior}\]

Probability intervals

  • PMF \(p(\theta)\) or PDF \(f(\theta)\)
  • \(p\)-probability interval for \(\theta\)
    • An interval \([a,b]\) with \(P(a \leq \theta \leq b) = p\)
  • Credible intervals
  • Not the same as confidence intervals

Probability intervals

Probability intervals

I think \(\theta\) is between 0.45 and 0.65 with 50% probability.

I think \(\theta\) follows a \(\text{Beta}(8, 6)\) distribution.

Frequentist inference

\[P(H|D) = \frac{P(D | H) \times P(H)}{P(D)}\]

  • When the prior is known
  • When the prior is not known
    • Bayesians
    • Frequentists

      \[L(H | D) = P(D|H)\]

  • What is the meaning of probability?
  • Bayesians - probability of hypotheses and data
  • Frequentists - probability of data given a hypothesis

Null hypothesis testing

  • Assume an hypothesis is true
  • Observed data sampled from that distribution
  • Probability of observing data given hypothesis
  • \(t\)-tests
  • \(p\)-values

Confidence intervals

  • Confidence intervals \(\neq\) credible intervals
  • Confidence intervals describe variability in the data \(D\) for a fixed parameter \(\theta\)
    • “With a large number of repeated samples, 95% of such calculated confidence intervals would include the true value of the parameter”
  • Credible intervals describe variability in the parameter \(\theta\) for a fixed data \(D\)
    • “There is a 90% chance that the parameter falls between these two values”

Comparison

  • Bayesian inference
    • Uses probabilities for both hypotheses and data
    • Depends on the prior and likelihood of observed data
    • Requires one to know or construct a “subjective prior”
    • May be computationally intensive due to integration over many parameters
  • Frequentist inference
    • Never uses or gives the probability of a hypothesis
    • Depends on the likelihood \(P(D|H)\) for both observed and unobserved data
    • Does not require a prior
    • Tends to be less computationally intensive

Critique of Bayesian inference

  1. The subjective prior is subjective
  2. Philosphical difference of opinion over probability

Defense of Bayesian inference

  1. The probability of hypotheses is exactly what we need to make decisions
  2. Bayes’ theorem is logically rigorous
  3. Test and compare different priors
  4. Easy to communicate a result framed in terms of probabilities of hypotheses
  5. Priors can be defended based on the assumptions made to arrive at it
  6. Evidence derived from the data is independent of notions about “data more extreme” that depend on the exact experimental setup
  7. Data can be used as it comes in

Critique of frequentist inference

  1. It is ad-hoc and does not carry the force of deductive logic
  2. Experiments need to be fully specified ahead of time
  3. p-values and significance levels are notoriously prone to misinterpretation

Defense of frequentist inference

  1. It is objective
  2. The hypothesis testing process is applied in a consistent and rigorous fashion
  3. Frequentist experimental design demands a careful description of the experiment and methods of analysis before starting
  4. The frequentist approach has been used for over 100 years

p-value test

You run a two-sample \(t\)-test for equal means, with \(\alpha = 0.05\) and obtain a p-value of 0.04. What are the odds that the two samples are drawn from distributions with the same mean?

  1. \(\frac{19}{1}\)
  2. \(\frac{1}{19}\)
  3. \(\frac{1}{20}\)
  4. \(\frac{1}{24}\)

p-value test

With a p-value of less than 0.05, we reject the null hypothesis that the difference in means between the two samples is zero.

There is a 95% probability that the difference in means between the two samples falls between \([-.02, .03]\).

Computing Bayesian models

  • Closed form analytic solutions
  • Complex models/multivariate probability density/mass functions
  • Computational approaches
    • Markov chain Monte Carlo (MCMC)
    • Gibbs sampling
  • Mechanics are more difficult to learn and explain