Introduction to Biostatistics in Medicine
Welcome to the website for HST 190: Introduction to Biostatistics, an introductory course in (bio)statistics for students in the Health Sciences and Technology program at Harvard Medical School (HMS) and MIT. This version of the course was most recently offered in Summer 2024.
Course Description
Evidence-based medicine—and evidence-based innovation in the health and medical sciences—depends critically on the evaluation and interpretation of data. Not only does this require a deep understanding of the scientific subject matter (i.e., a medical specialty), but it also necessitates a strong grasp of the research methodology necessary to distill experimental data into evidence. Biostatistics serves as a foundational pillar of research methodology in the biomedical and health sciences, ideally acting as an engine for evidence evaluation and as a safeguard against drawing incorrect inferences from data. This course provides training on how to comprehend, critique and communicate findings from the health and medical sciences literature, focusing especially on how to assess the importance of chance in the interpretation of experimental and observational data. Major topics to be reviewed include core elements of probability and statistics, statistical inference, the linear model and its generalizations, survival analysis, causal inference, statistical considerations for clinical trials, and reproducible statistical data analysis. These topics will be reviewed together with the critical reading of studies published in the health and medical sciences literature. Emphasis is on understanding how careful statistical thinking can help prevent mistakes that arise from relying upon intuition alone and how the rote application of statistical techniques (that is, a “toolbox approach”) can lead to logical mistakes that invalidate an analysis and undermine its scientific conclusions.
Course Outline
The course is divided into a series of modules. The exact weekly coverage of topics is subject to change, as it will depend on the progress of the class. Anticipated time for the completion of each module is given below. Since the course consists in roughly 10 3-hour meetings, each module will be covered in just one or two class sessions. Below, required readings are listed in bold; all other readings are optional, though highly recommended. Consult the course syllabus for more details.
- Biostatistics and data science in medicine: Brief history of (bio)statistics; philosophy of science and statistics; numerical and categorical data; exploratory data analysis; principles of reproducible statistical data science; observational studies and randomized controlled trials.
- Time anticipated: 1 meeting (slides: from 12th Aug.)
- Books and tutorials: Vu and Harrington (2020) (Ch. 1), Altman (1990) (Ch. 1-3), Jewell (2004) (Ch. 1-2), Freedman (2009) (Ch. 1), Senn (2022) (Ch. 1-2)
- Notes from The BMJ: Altman and Bland (1999a), Altman and Bland (1994), Altman and Bland (1998a), Altman and Bland (1999b), Bland and Altman (2011)
- Literature and commentaries: Kluytmans-van Den Bergh et al. (2020), Altman (1998), van der Laan (2015), Stark and Saltelli (2018), Wainer (1984), Broman and Woo (2018), Peng and Hicks (2021)
- Concepts of probability: Basic axioms and laws of probability and conditional probability; Bayes’ theorem; random variables and their key properties; expectation, conditional expectation, and covariance; common families of random variables (Normal, Binomial, Poisson); marginal, joint, and conditional distributions; Bernoulli trials.
- Time anticipated: 2 meetings (slides: from 14th Aug. and from 16th Aug.)
- Books and tutorials: Vu and Harrington (2020) (Ch. 2-3), Altman (1990) (Ch. 4), Jewell (2004) (Ch. 3), Aronow and Miller (2019) (Ch. 1-2), Senn (2022) (Ch. 3-4)
- Notes from The BMJ: None assigned
- Literature and commentaries: Freedman and Stark (2001) (on probability), Altamirano et al. (2020) (measuring incidence), Milne and Whitty (1995) (using approximations)
- Concepts of statistics: Point estimation and variability of point estimates; confidence intervals and hypothesis testing; the Normal and \(t\)-distributions; one-sample inference (means and proportions); two-sample inference for paired and independent data (difference-of-means and difference-of-proportions); analysis of variance (ANOVA); contingency tables, \(\chi^2\) tests of association, and Fisher’s exact test; overview of permutation testing.
- Time anticipated: 2 meetings (slides: from 19th Aug. and from 21st Aug.)
- Books and tutorials: Vu and Harrington (2020) (Ch. 4-5, 8), Altman (1990) (Ch. 5, 8–10), Jewell (2004) (Ch. 4–7, 11), Aronow and Miller (2019) (Ch. 3), Senn (2022) (Ch. 5)
- Notes from The BMJ: Bland and Altman (1994a), Bland and Altman (1995), Altman and Bland (1996), Altman and Bland (2005), Altman and Bland (2014a)
- Literature and commentaries: Cranston et al. (2020) (inference with one sample), Chowdhary et al. (2020) (inference with two samples), Rafi and Greenland (2020), Cole, Edwards, and Greenland (2021)
- Linear regression and its generalizations: Regression to the mean; the general linear model and interpretation of model parameters; estimating linear regression parameters via least squares; point estimation of and inference for regression parameters; multiple linear regression and statistical interaction; ANOVA (revisited) and ANCOVA; introduction to generalized linear models (GLMs) via logistic regression; principles of model evaluation and model selection; introduction to regularized regression for statistical prediction.
- Time anticipated: 2 meetings (slides: from 21st Aug. and from 23rd Aug.)
- Books and tutorials: Vu and Harrington (2020) (Ch. 6-7), Altman (1990) (Ch. 11-12, 15), Jewell (2004) (Ch. 12-13), Freedman (2009) (Ch. 2-5), Aronow and Miller (2019) (Ch. 4-5), Senn (2022) (Ch. 6, 8)
- Notes from The BMJ: Altman and Royston (2006), Bland and Altman (1994b), Bland and Altman (1994c), Bland and Altman (2000)
- Literature and commentaries: Turner et al. (2020) (linear regression), Avadhanula et al. (2020) (logistic regression), Freedman (2008a)
- Survival analysis, causal inference, and clinical trials: Time-to-event outcomes and right-censoring; the Kaplan-Meier estimator of the survival curve, Greenwood’s formula, and the log-rank test; overview of the Cox proportional hazards model; randomized controlled trials and confounding; the Neyman-Rubin potential outcomes framework and the fundamental problem of causal inference; target trial emulation and the ICH estimands framework for clinical trials.
- Time anticipated: 3 meetings (slides: from 28th Aug., from 30th Aug., and from 4th Sep.)
- Books and tutorials: Freedman (2008b), Dalgaard (2008) (Ch. 14), Altman (1990) (Ch. 13), Jewell (2004) (Ch. 8, 17) Aronow and Miller (2019) (Ch. 6-7), Senn (2022) (Ch. 7, 9)
- Notes from The BMJ: Altman and Bland (1998b), Bland and Altman (1998), Bland and Altman (2004), Altman and Bland (2007), Altman and Bland (2014b)
- Literature and commentaries: Nguyen et al. (2019) (survival analysis), Hoffman et al. (2022) (causal inference), Weir, Dufault, and Phillips (2024) (estimands framework), Hernán and Taubman (2008), Hernán and Robins (2016), Hernán et al. (2016), Hernán and Monge (2023), US Food and Drug Administration (2021), Kahan et al. (2023), Kahan et al. (2024)
Acknowledgments
This course is based heavily on materials from a previous version (2017-2022), developed and taught by Prof. Sebastien Haneuse (who generously shared his materials), and on resources developed for use of the Vu and Harrington (2020) textbook in Harvard’s STAT 102.
References
R
. Springer. https://doi.org/10.1007/978-0-387-79054-1.