← Back to Blog

Biostatistics for Step 3: The Complete Cheat Sheet

Step3Sim Team11 min read
biostatisticshigh-yieldreference

Let me tell you something that should make you very happy: biostatistics is the single easiest section of Step 3 to master. The content is completely finite. There are maybe 15 core concepts, a handful of formulas, and a few classic trap questions. You can learn all of it in one focused evening and lock in 5-8% of the exam.

Most residents avoid biostats because it feels hard. It's not. It feels hard because you memorized it for Step 1, forgot it, re-memorized it for Step 2, and forgot it again. This time, you're going to understand it — and understanding sticks in a way that memorizing doesn't.

The 2×2 Table: Your Single Most Powerful Tool

If you take nothing else from this article, take this: every diagnostic test question on Step 3 can be solved with a 2×2 table drawn on your noteboard. Every. Single. One. Don't try to do it in your head. Draw the table, fill in the numbers, calculate.

Disease + Disease −
Test + True Positive (TP) False Positive (FP)
Test − False Negative (FN) True Negative (TN)

From this table, everything else is just division:

  • Sensitivity = TP / (TP + FN) — "Of everyone who's sick, how many did the test catch?"
  • Specificity = TN / (TN + FP) — "Of everyone who's healthy, how many did the test correctly clear?"
  • PPV = TP / (TP + FP) — "If the test is positive, what's the chance the patient actually has it?"
  • NPV = TN / (TN + FN) — "If the test is negative, what's the chance the patient is actually fine?"

The Mnemonics That Actually Help

SnNOut: A highly Sensitive test, when Negative, rules Out disease. (Because sensitivity means few false negatives — if the test is negative, you can trust it.)

SpPIn: A highly Specific test, when Positive, rules In disease. (Because specificity means few false positives — if the test is positive, you can trust it.)

These aren't just mnemonics — they tell you why we use certain tests for screening (high sensitivity) versus confirmation (high specificity).

The PPV Trap (They Love This Question)

Here's the question they ask almost every exam: "A test has 99% sensitivity and 99% specificity. The disease prevalence is 1%. What is the PPV?"

Most residents guess somewhere around 95%. The actual answer is about 50%. Here's why:

In a population of 10,000:

  • 100 people have the disease (1% prevalence)
  • The test catches 99 of them (99% sensitivity) → 99 TPs, 1 FN
  • Of the 9,900 healthy people, 1% test positive (99% specificity) → 99 FPs, 9,801 TNs
  • PPV = 99 / (99 + 99) = 50%

Half of all positive results are false positives. Even with a near-perfect test. This is why screening for rare diseases produces so many false alarms, and it's why the exam keeps testing this concept — because the intuition is wrong and the math is right.

Contrarian take: Many study guides tell you to memorize "PPV increases with prevalence, NPV decreases with prevalence" as a fact. That's correct but useless. If you understand why — more sick people in the population means each positive test is more likely to be a true positive — you'll never get this wrong, even when they change the numbers. Understanding beats memorization every time in biostats.

NNT: The Most Clinically Useful Number They Test

Number Needed to Treat answers the question every patient should be asking: "How many people have to take this drug for one person to actually benefit?"

NNT = 1 / ARR (Absolute Risk Reduction)

ARR = Control Event Rate − Treatment Event Rate

Worked example: A statin reduces heart attacks from 8% to 5% over 5 years.

  • ARR = 0.08 − 0.05 = 0.03
  • NNT = 1 / 0.03 = 33.3 → round up to 34

You need to treat 34 patients for 5 years for one patient to avoid a heart attack. Whether that's "worth it" depends on cost, side effects, and patient values — but that's the number.

NNH (Number Needed to Harm) uses the same formula with the absolute risk increase for adverse events.

Surprising insight: Here's something most review books skip — NNT is always context-dependent on the time horizon. An NNT of 34 "over 5 years" is very different from an NNT of 34 "over 1 year." When the exam gives you NNT data, check the timeframe. Some answer choices will try to trick you by comparing NNTs from studies with different follow-up periods.

Relative Risk vs. Odds Ratio: Know Which Study Uses Which

This is where residents lose easy points — not because the math is hard, but because they mix up when to use which measure.

Relative Risk (RR): Used in cohort studies and RCTs.

  • RR = Risk in exposed / Risk in unexposed
  • RR = 1 → no association
  • RR > 1 → exposure increases risk
  • RR < 1 → exposure is protective

Odds Ratio (OR): Used in case-control studies.

  • OR = (TP × TN) / (FP × FN) — using exposure in the 2×2 table
  • When disease is rare (<10%), OR ≈ RR

The classic Step 3 trap: A question describes a case-control study and asks you to calculate the "relative risk." The correct approach is to calculate the odds ratio, because RR cannot be directly computed from case-control data. They're testing whether you know the difference, not whether you can do arithmetic.

Study Design: The Hierarchy They Keep Testing

You need to know what each design can and cannot tell you:

Design Measures Strengths Weaknesses
RCT Causation Gold standard, controls confounding Expensive, ethical limitations
Cohort RR, incidence, temporality Multiple outcomes, temporal sequence Expensive, attrition, takes years
Case-Control OR only Fast, cheap, good for rare diseases Recall bias, can't calculate incidence
Cross-Sectional Prevalence, association Quick snapshot No temporality, no causation
Case Report/Series Nothing statistical Generates hypotheses No comparison group

The hierarchy matters because the exam will ask: "What study design is most appropriate to investigate [X]?" The answer depends on what you need to measure and what's ethically feasible.

Quick decision tree:

  • Need to prove causation? → RCT (if ethical)
  • Studying a rare disease? → Case-control
  • Want incidence data? → Cohort
  • Need a quick prevalence estimate? → Cross-sectional

Bias: The Seven Types They Love

I could list 20 types of bias, but Step 3 tests about seven repeatedly. Know these cold:

Selection bias — your study population doesn't represent the target population. Example: studying heart disease only in hospitalized patients misses mild cases. Fix: random sampling.

Recall bias — participants with the outcome remember exposures differently. Example: mothers of children with birth defects remember medication use more vividly than control mothers. Fix: prospective design, objective records.

Observer/measurement bias — the person measuring the outcome knows the exposure status. Fix: blinding (single-blind, double-blind).

Lead-time bias — screening makes survival appear longer by detecting disease earlier, even if death occurs at the same age. This is the most counterintuitive bias and the one they test most often. Fix: measure mortality rate, not survival from diagnosis.

Length-time bias — screening preferentially catches slow-growing, less aggressive disease (because it's present longer to be detected). Fix: RCT comparing screened vs. unscreened populations.

Attrition bias — participants drop out differentially between groups. Sicker patients in the treatment group stop taking the drug → the treatment group looks healthier. Fix: intention-to-treat analysis.

Confounding — a third variable explains the apparent association. Classic example: coffee drinking is associated with lung cancer — but only because coffee drinkers are more likely to smoke. Fix: randomization, stratification, multivariate analysis.

Statistical Tests: Don't Calculate, Just Match

The exam never asks you to actually perform a t-test or chi-squared test. It asks you to identify which test is appropriate. Here's the decision tree:

Comparing means (continuous data):

  • 2 groups, normal distribution → Student's t-test
  • 2 groups, non-normal → Mann-Whitney U
  • 3+ groups, normal → ANOVA
  • 3+ groups, non-normal → Kruskal-Wallis
  • Before/after (paired data) → Paired t-test

Comparing proportions (categorical data):

  • Large sample → Chi-squared
  • Small sample (any expected cell < 5) → Fisher's exact test

Survival data:

  • Comparing survival curves → Log-rank test
  • Displayed on → Kaplan-Meier curve

That's the entire list. Nine tests. The decision is always: what type of data, how many groups, and is the distribution normal?

P-Values and Confidence Intervals: The Two Concepts That Trip Everyone

P-Value

The probability of observing your result (or something more extreme) if the null hypothesis were true. That's it. It does NOT tell you:

  • The probability that the null hypothesis is true
  • The probability that your result is real
  • How clinically important the finding is

p < 0.05 = statistically significant by convention. But a p-value of 0.001 in a study showing a 0.1 mmHg blood pressure reduction is statistically significant and clinically meaningless.

Confidence Intervals

The 95% CI tells you: "If we repeated this study 100 times, 95 of those CIs would contain the true value."

The decision rule for Step 3:

  • If the 95% CI for an OR or RR includes 1.0 → NOT statistically significant (no effect)
  • If the 95% CI for an OR or RR does NOT include 1.0 → statistically significant
  • For mean differences: if the CI includes 0 → not significant

Classic Step 3 question: "RR = 1.4, 95% CI 0.9–2.1. Is this significant?" No. The CI crosses 1.0. The association might be real, but this study didn't prove it.

Type I Error, Type II Error, and Power

  • Type I error (α): You find a difference that doesn't exist. False positive. Set at 0.05.
  • Type II error (β): You miss a difference that does exist. False negative.
  • Power = 1 − β: The probability of detecting a real effect. Target: ≥0.80.

"How do you increase power?" — This question appears on nearly every exam.

The answer they want most often: increase sample size. Other correct answers: increase effect size, increase alpha, decrease measurement variability. But "increase sample size" is correct about 80% of the time.

ITT vs. Per-Protocol: Know Which One the Exam Prefers

Intention-to-treat (ITT): Analyze everyone based on their assigned group, even if they didn't complete the study or switched groups. Preserves randomization. Reflects real-world effectiveness.

Per-protocol: Analyze only participants who completed the study as designed. May overestimate treatment effect. Subject to attrition bias.

Step 3 almost always prefers ITT because it mirrors real clinical practice — patients don't always take their medications, and ITT accounts for that reality.

The Quick Reference Card

For exam day, here's everything on one mental card:

  • Sensitivity: TP/(TP+FN) — SnNOut
  • Specificity: TN/(TN+FP) — SpPIn
  • PPV: TP/(TP+FP) — affected by prevalence
  • NPV: TN/(TN+FN) — affected by prevalence
  • NNT: 1/ARR
  • RR: cohort/RCT. OR: case-control
  • CI includes 1.0 = not significant
  • Increase power = increase sample size
  • Lead-time bias: screening doesn't change outcome, just detection timing
  • ITT > per-protocol for most Step 3 questions

These questions are genuine free points. Step3Sim includes dedicated biostatistics question sets that drill every concept here with worked examples and step-by-step explanations.

FAQ

Q: How many biostatistics questions are on Step 3? Roughly 5-8% of the exam, which translates to approximately 15-25 questions across both days. That's a meaningful chunk of your score for a finite body of knowledge that can be mastered in 5-7 hours of focused study.

Q: Should I memorize all the statistical tests? Don't memorize them — learn the decision tree. The question is always: what type of data (continuous vs. categorical), how many groups, and normal vs. non-normal distribution? If you can answer those three questions, the correct test falls out automatically.

Q: Is lead-time bias or length-time bias tested more often? Lead-time bias appears more frequently and is more commonly answered incorrectly. The key insight is that early detection through screening does NOT automatically improve outcomes — it may simply add "years of being a patient" without changing the death date. Length-time bias is tested less often but catches people who assume all screen-detected cancers are representative of the full disease spectrum.

Q: Do I need to know how to calculate a chi-squared statistic? No. You need to know when to use it (categorical data, comparing proportions, adequate sample size). The exam gives you the test result or p-value — it never asks you to compute the test statistic itself.

Q: What's the most commonly tested biostats concept on Step 3? The PPV-prevalence relationship, the 2×2 table calculations, and the CI interpretation for significance. If you nail these three, you'll get the majority of biostats questions correct.