Introduction to Biostatistics in Medicine

Published

03 September 2025

Welcome to the HST 190: Introduction to Biostatistics, an introductory course in (bio)statistical methods for students in the Health Sciences and Technology program at Harvard Medical School (HMS) and MIT. This version of the course was most recently offered in Summer 2025.

Course Description

Evidence-based medicine—and evidence-based innovation in the health and medical sciences—depends critically on the evaluation and interpretation of data. Not only does this require a deep understanding of the scientific subject matter (i.e., a medical specialty), but it also necessitates a strong grasp of the research methodology necessary to distill experimental data into evidence. Biostatistics serves as a foundational pillar of research methodology in the biomedical and health sciences, ideally acting as an engine for evidence evaluation and as a safeguard against drawing incorrect inferences from data. This course provides training on how to comprehend, critique and communicate findings from the health and medical sciences literature, focusing especially on how to assess the importance of chance in the interpretation of experimental and observational data. Major topics to be reviewed include core elements of probability and statistics, statistical inference, the linear model and its generalizations, survival analysis, causal inference, statistical considerations for clinical trials, and reproducible statistical data analysis. These topics will be reviewed together with the critical reading of studies published in the health and medical sciences literature. Emphasis is on understanding how careful statistical thinking can help prevent mistakes that arise from relying upon intuition alone and how the rote application of statistical techniques (that is, a “toolbox approach”) can lead to logical mistakes that invalidate an analysis and undermine its scientific conclusions.

Course Outline

The course is divided into a series of modules. The exact weekly coverage of topics is subject to change, as it will depend on the progress of the class. Anticipated time for the completion of each module is given below. Since the course consists in roughly 10 3-hour meetings, each module will be covered in just one or two class sessions. Below, required readings are listed in bold; all other readings are optional, though highly recommended. Consult the course syllabus for more details.

Biostatistics and data science in medicine: Brief history of (bio)statistics; philosophy of science and statistics; numerical and categorical data; exploratory data analysis; principles of reproducible statistical data science; observational studies and randomized controlled trials.
- Time anticipated: 1 meeting (slides: from 11^th Aug.)
- Books and tutorials: Vu and Harrington (2020) (Sec. 1.2, 1.3.5–1.3.7, 1.4, 1.5, 1.6), Altman (1990) (Ch. 1-3), Jewell (2004) (Ch. 1-2), Freedman (2009) (Ch. 1), Senn (2022) (Ch. 1-2)
- Notes from The BMJ: Altman and Bland (1999a), Altman and Bland (1994), Altman and Bland (1998a), Altman and Bland (1999b), Bland and Altman (2011)
- Literature and commentaries: Kluytmans-van Den Bergh et al. (2020), Altman (1998), van der Laan (2015), Stark and Saltelli (2018), Wainer (1984), Broman and Woo (2018), Peng and Hicks (2021)
Concepts of probability: Basic axioms and laws of probability and conditional probability; Bayes’ theorem; random variables and their key properties; expectation, conditional expectation, and covariance; common families of random variables (Normal, Binomial, Poisson); marginal, joint, and conditional distributions; Bernoulli trials.
- Time anticipated: 1 meeting (slides: from 13^th Aug. part i and part ii)
- Books and tutorials: Vu and Harrington (2020) (Sec. 2.1–2.2 and 3.1.1–3.1.3, 3.2, 3.3, 3.4), Altman (1990) (Ch. 4), Jewell (2004) (Ch. 3), Aronow and Miller (2019) (Ch. 1-2), Senn (2022) (Ch. 3-4)
- Notes from The BMJ: None assigned
- Literature and commentaries: Freedman and Stark (2001) (on probability), Altamirano et al. (2020) (measuring incidence), Milne and Whitty (1995) (using approximations)
Concepts of statistics: Point estimation and variability of point estimates; confidence intervals and hypothesis testing; the Normal and $t$ -distributions; one-sample inference (means and proportions); two-sample inference for paired and independent data (difference-of-means and difference-of-proportions); analysis of variance (ANOVA); contingency tables, $χ^{2}$ tests of association, and Fisher’s exact test; overview of permutation testing.
- Time anticipated: 2 meetings (slides: from 15^th Aug. and from 18^th Aug.)
- Books and tutorials: Vu and Harrington (2020) (Sec. 4.1–4.3 and 5.1–5.5 and 8.1–8.5), Altman (1990) (Ch. 5, 8–10), Jewell (2004) (Ch. 4–7, 11), Aronow and Miller (2019) (Ch. 3), Senn (2022) (Ch. 5)
- Notes from The BMJ: Bland and Altman (1994a), Bland and Altman (1995), Altman and Bland (1996), Altman and Bland (2005), Altman and Bland (2014a)
- Literature and commentaries: Cranston et al. (2020) (inference with one sample), Chowdhary et al. (2020) (inference with two samples), Rafi and Greenland (2020), Cole, Edwards, and Greenland (2021)
Linear regression and its generalizations: Regression to the mean; the general linear model and interpretation of model parameters; estimating linear regression parameters via least squares; point estimation of and inference for regression parameters; multiple linear regression and statistical interaction; ANOVA (revisited) and ANCOVA; introduction to generalized linear models (GLMs) via logistic regression; principles of model evaluation and model selection; introduction to regularized regression for statistical prediction.
- Time anticipated: 3 meetings (slides: from 20^th Aug., from 22^nd Aug., and from 25^th Aug.)
- Books and tutorials: Vu and Harrington (2020) (Sec. 6.1–6.5 and 7.1–7.8), LaValley (2008), Altman (1990) (Ch. 11-12, 15), Jewell (2004) (Ch. 12-13), Freedman (2009) (Ch. 2-5), Aronow and Miller (2019) (Ch. 4-5), Senn (2022) (Ch. 6, 8)
- Notes from The BMJ: Altman and Royston (2006), Bland and Altman (1994b), Bland and Altman (1994c), Bland and Altman (2000)
- Literature and commentaries: Turner et al. (2020) (linear regression), Avadhanula et al. (2020) (logistic regression), Freedman (2008)
Survival analysis, causal inference, and clinical trials: Time-to-event outcomes and right-censoring; the Kaplan-Meier estimator of the survival curve, Greenwood’s formula, and the log-rank test; overview of the Cox proportional hazards model; randomized controlled trials and confounding; the Neyman-Rubin potential outcomes framework and the fundamental problem of causal inference; target trial emulation and the ICH estimands framework for clinical trials.
- Time anticipated: 3 meetings (slides: from 27^th Aug., from 29^th Aug., and from 3^rd Sep.)
- Books and tutorials: Clark et al. (2003), Bradburn et al. (2003), Dalgaard (2008) (Ch. 14), Altman (1990) (Ch. 13), Jewell (2004) (Ch. 8, 17) Aronow and Miller (2019) (Ch. 6-7), Senn (2022) (Ch. 7, 9)
- Notes from The BMJ: Altman and Bland (1998b), Bland and Altman (1998), Bland and Altman (2004), Altman and Bland (2007), Altman and Bland (2014b)
- Literature and commentaries: Nguyen et al. (2019) (survival analysis), Hernán et al. (2008) (causal inference), Weir, Dufault, and Phillips (2024) (estimands framework), Hernán and Robins (2006), Hernán and Taubman (2008), Hernán and Robins (2016), Hernán et al. (2016), Hernán (2018), Hernán and Monge (2023), US Food and Drug Administration (2021), Kahan et al. (2023), Kahan et al. (2024)

Resources

Although course participants are not expected to be fluent in a high-level statistical or numerical programming language (e.g., R), modern (bio)statistics is a computing-intensive discipline; thus, familiarity with the basics of numerical/statistical programming will be helpful in wading through the material at a fast pace. Homework assignments and, to a lesser extent, in-class demonstrations will use the R language for statistical computing and graphics. Here are some resources that can help you to learn R, or that may be helpful in brushing up if you’ve already had an introduction.

Programming with R, by Software Carpentry
R for Reproducible Scientific Analysis, by Software Carpentry
Data Analysis and Visualization in R for Ecologists, by Data Carpentry
Introduction to R for Genomics, by Data Carpentry
Introduction to Programming with R, by Harvard CS50

Beyond these specific introductions, Software Carpentry and Data Carpentry offer access to free introductory lessons on various topics in practical computing, targeted largely to applied scientists. You may want to browse their lessons: https://software-carpentry.org/lessons/ and https://datacarpentry.org/lessons/.

Acknowledgments

This course is based heavily on materials from a previous version (2017-2022), developed and taught by Prof. Sebastien Haneuse (who generously shared his materials), and on resources developed for use of the Vu and Harrington (2020) textbook in Harvard’s STAT 102.

References

Altamirano, Jonathan, Prasanthi Govindarajan, Andra L Blomkalns, Lauren E Kushner, Bryan Andrew Stevens, Benjamin A Pinsky, and Yvonne Maldonado. 2020. “Assessment of Sensitivity and Specificity of Patient-Collected Lower Nasal Specimens for Severe Acute Respiratory Syndrome Coronavirus 2 Testing.” JAMA Network Open 3 (6): e2012005. https://doi.org/10.1001/jamanetworkopen.2020.12005.

Altman, Douglas G. 1990. Practical Statistics for Medical Research. CRC Press.

———. 1998. “Statistical Reviewing for Medical Journals.” Statistics in Medicine 17 (23): 2661–74. https://doi.org/10.1002/(SICI)1097-0258(19981215)17:23<2661::AID-SIM33>3.0.CO;2-B.

Altman, Douglas G, and J Martin Bland. 1994. “Statistics Notes: Quartiles, Quintiles, Centiles, and Other Quantiles.” The BMJ 309 (6960): 996. https://doi.org/10.1136/bmj.309.6960.996.

———. 1996. “Statistics Notes: Comparing Several Groups Using Analysis of Variance.” The BMJ 312 (7044): 1472–73. https://doi.org/10.1136/bmj.312.7044.1472.

———. 1998a. “Generalisation and Extrapolation.” The BMJ 317 (7155): 409. https://doi.org/10.1136/bmj.317.7155.409.

———. 1998b. “Time to Event (Survival) Data.” The BMJ 317 (7156): 468–69. https://doi.org/10.1136/bmj.317.7156.468.

———. 1999a. “Statistics Notes: Variables and Parameters.” The BMJ 318 (7199): 1667. https://doi.org/10.1136/bmj.318.7199.1667.

———. 1999b. “Treatment Allocation in Controlled Trials: Why Randomise?” The BMJ 318 (7192): 1209. https://doi.org/10.1136/bmj.318.7192.1209.

———. 2005. “Standard Deviations and Standard Errors.” The BMJ 331 (7521): 903. https://doi.org/10.1136/bmj.331.7521.903.

———. 2007. “Missing Data.” The BMJ 334 (7590): 424. https://doi.org/10.1136/bmj.38977.682025.2C.

———. 2014a. “Uncertainty and Sampling Error.” The BMJ 349. https://doi.org/10.1136/bmj.g7064.

———. 2014b. “Uncertainty Beyond Sampling Error.” The BMJ 349. https://doi.org/10.1136/bmj.g7065.

Altman, Douglas G, and Patrick Royston. 2006. “The Cost of Dichotomising Continuous Variables.” The BMJ 332 (7549): 1080. https://doi.org/10.1136/bmj.332.7549.1080.

Aronow, Peter M, and Benjamin T Miller. 2019. Foundations of Agnostic Statistics. Cambridge University Press. https://doi.org/10.1017/9781316831762.

Avadhanula, Shirisha, Wendy J Introne, Sungyoung Auh, Steven J Soldin, Brian Stolze, Debra Regier, Carla Ciccone, et al. 2020. “Assessment of Thyroid Function in Patients with Alkaptonuria.” JAMA Network Open 3 (3): e201357. https://doi.org/10.1001/jamanetworkopen.2020.1357.

Bland, J Martin, and Douglas G Altman. 1994a. “Statistics Notes: One- and Two-Sided Tests of Significance.” The BMJ 309 (6949): 248. https://doi.org/10.1136/bmj.309.6949.248.

———. 1994b. “Statistics Notes: Regression Towards the Mean.” The BMJ 308 (6942): 1499. https://doi.org/10.1136/bmj.308.6942.1499.

———. 1994c. “Statistics Notes: Some Examples of Regression Towards the Mean.” The BMJ 309 (6957): 780. https://doi.org/10.1136/bmj.309.6957.780.

———. 1995. “Statistics Notes: The Normal Distribution.” The BMJ 310 (6975): 298. https://doi.org/10.1136/bmj.310.6975.298.

———. 1998. “Survival Probabilities (the Kaplan-Meier Method).” The BMJ 317 (7172): 1572. https://doi.org/10.1136/bmj.317.7172.1572.

———. 2000. “The Odds Ratio.” The BMJ 320 (7247): 1468. https://doi.org/10.1136/bmj.320.7247.1468.

———. 2004. “The Logrank Test.” The BMJ 328 (7447): 1073. https://doi.org/10.1136/bmj.328.7447.1073.

———. 2011. “Comparisons Within Randomised Groups Can Be Very Misleading.” The BMJ 342. https://doi.org/10.1136/bmj.d561.

Bradburn, Mike J, Taane G Clark, Sharon B Love, and Douglas Graham Altman. 2003. “Survival Analysis Part II: Multivariate Data Analysis–an Introduction to Concepts and Methods.” British Journal of Cancer 89 (3): 431–36. https://doi.org/10.1038/sj.bjc.6601119.

Broman, Karl W, and Kara H Woo. 2018. “Data Organization in Spreadsheets.” The American Statistician 72 (1): 2–10. https://doi.org/10.1080/00031305.2017.1375989.

Chowdhary, Mudit, Akansha Chowdhary, Trevor J Royce, Kirtesh R Patel, Arpit M Chhabra, Shikha Jain, Miriam A Knoll, Neha Vapiwala, Barbara Pro, and Gaurav Marwaha. 2020. “Women’s Representation in Leadership Positions in Academic Medical Oncology, Radiation Oncology, and Surgical Oncology Programs.” JAMA Network Open 3 (3): e200708. https://doi.org/10.1001/jamanetworkopen.2020.0708.

Clark, Taane G, Michael J Bradburn, Sharon B Love, and Douglas G Altman. 2003. “Survival Analysis Part i: Basic Concepts and First Analyses.” British Journal of Cancer 89 (2): 232–38. https://doi.org/10.1038/sj.bjc.6601118.

Cole, Stephen R, Jessie K Edwards, and Sander Greenland. 2021. “Surprise!” American Journal of Epidemiology 190 (2): 191–93. https://doi.org/10.1093/aje/kwaa136.

Cranston, Jessica S, Sophia Finn Tiene, Karin Nielsen-Saines, Zilton Vasconcelos, Marcos V Pone, Sheila Pone, Andrea Zin, et al. 2020. “Association Between Antenatal Exposure to Zika Virus and Anatomical and Neurodevelopmental Abnormalities in Children.” JAMA Network Open 3 (7): e209303. https://doi.org/10.1001/jamanetworkopen.2020.9303.

Dalgaard, Peter. 2008. Introductory Statistics with R. Springer. https://doi.org/10.1007/978-0-387-79054-1.

Freedman, David A. 2008. “Randomization Does Not Justify Logistic Regression.” Statistical Science 23 (2): 237–49. https://doi.org/10.1214/08-STS262.

———. 2009. Statistical Models: Theory and Practice. Cambridge University Press. https://doi.org/10.1017/CBO9780511815867.

Freedman, David A, and Philip B Stark. 2001. “What Is the Chance of an Earthquake?” 611. University of California, Berkeley. https://statistics.berkeley.edu/tech-reports/611.

Hernán, Miguel A. 2018. “The c-Word: Scientific Euphemisms Do Not Improve Causal Inference from Observational Data.” American Journal of Public Health 108 (5): 616–19. https://doi.org/10.2105/AJPH.2018.304337.

Hernán, Miguel A, Alvaro Alonso, Roger Logan, Francine Grodstein, Karin B Michels, Walter C Willett, JoAnn E Manson, and James M Robins. 2008. “Observational Studies Analyzed Like Randomized Experiments: An Application to Postmenopausal Hormone Therapy and Coronary Heart Disease.” Epidemiology 19 (6): 766–79. https://doi.org/10.1097/EDE.0b013e3181875e61.

Hernán, Miguel A, and Susana Monge. 2023. “Selection Bias Due to Conditioning on a Collider.” The BMJ 381: 1135. https://doi.org/10.1136/bmj.p1135.

Hernán, Miguel A, and James M Robins. 2006. “Estimating Causal Effects from Epidemiological Data.” Journal of Epidemiology & Community Health 60 (7): 578–86. https://doi.org/10.1136/jech.2004.029496.

———. 2016. “Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available.” American Journal of Epidemiology 183 (8): 758–64. https://doi.org/10.1093/aje/kwv254.

Hernán, Miguel A, Brian C Sauer, Sonia Hernández-Dı́az, Robert Platt, and Ian Shrier. 2016. “Specifying a Target Trial Prevents Immortal Time Bias and Other Self-Inflicted Injuries in Observational Analyses.” Journal of Clinical Epidemiology 79: 70–75. https://doi.org/10.1016/j.jclinepi.2016.04.014.

Hernán, Miguel A, and Sarah L Taubman. 2008. “Does Obesity Shorten Life? The Importance of Well-Defined Interventions to Answer Causal Questions.” International Journal of Obesity 32 (S3): S8. https://doi.org/10.1038/ijo.2008.82.

Jewell, Nicholas P. 2004. Statistics for Epidemiology. CRC Press. https://doi.org/10.1201/9781482286014.

Kahan, Brennan C, Suzie Cro, Fan Li, and Michael O Harhay. 2023. “Eliminating Ambiguous Treatment Effects Using Estimands.” American Journal of Epidemiology 192 (6): 987–94. https://doi.org/10.1093/aje/kwad036.

Kahan, Brennan C, Joanna Hindley, Mark Edwards, Suzie Cro, and Tim P Morris. 2024. “The Estimands Framework: A Primer on the ICH E9 (R1) Addendum.” The BMJ 384: e076316. https://doi.org/10.1136/bmj-2023-076316.

Kluytmans-van Den Bergh, Marjolein F Q, Anton G M Buiting, Suzan D Pas, Robbert G Bentvelsen, Wouter Van Den Bijllaardt, Anne J G Van Oudheusden, Miranda M L Van Rijen, Jaco J Verweij, Marion P G Koopmans, and Jan A J W Kluytmans. 2020. “Prevalence and Clinical Presentation of Health Care Workers with Symptoms of Coronavirus Disease 2019 in Two Dutch Hospitals During an Early Phase of the Pandemic.” JAMA Network Open 3 (5): e209673. https://doi.org/10.1001/jamanetworkopen.2020.9673.

LaValley, Michael P. 2008. “Logistic Regression.” Circulation 117 (18): 2395–99. https://doi.org/10.1161/CIRCULATIONAHA.106.682658.

Milne, Eugene, and Paula Whitty. 1995. “Calculation of the Need for Paediatric Intensive Care Beds.” Archives of Disease in Childhood 73 (6): 505–7. https://doi.org/10.1136/adc.73.6.505.

Nguyen, Vy T, Ross D Zafonte, Jarvis T Chen, Kalé Z Kponee-Shovein, Sabrina Paganoni, Alvaro Pascual-Leone, Frank E Speizer, et al. 2019. “Mortality Among Professional American-style Football Players and Professional American Baseball Players.” JAMA Network Open 2 (5): e194223. https://doi.org/10.1001/jamanetworkopen.2019.4223.

Peng, Roger D, and Stephanie C Hicks. 2021. “Reproducible Research: A Retrospective.” Annual Review of Public Health 42: 79–93. https://doi.org/10.1146/annurev-publhealth-012420-105110.

Rafi, Zad, and Sander Greenland. 2020. “Semantic and Cognitive Tools to Aid Statistical Science: Replace Confidence and Significance by Compatibility and Surprise.” BMC Medical Research Methodology 20: 244. https://doi.org/10.1186/s12874-020-01105-9.

Senn, Stephen. 2022. Dicing with Death: Living by Data. Cambridge University Press. https://doi.org/10.1017/9781009000185.

Stark, Philip B, and Andrea Saltelli. 2018. “Cargo-Cult Statistics and Scientific Crisis.” Significance 15 (4): 40–43. https://doi.org/10.1111/j.1740-9713.2018.01174.x.

Turner, Lindsey, Julien Leider, Elizabeth Piekarz-Porter, and Jamie F Chriqui. 2020. “Association of State Laws Regarding Snacks in US Schools with Students’ Consumption of Solid Fats and Added Sugars.” JAMA Network Open 3 (1): e1918436. https://doi.org/10.1001/jamanetworkopen.2019.18436.

US Food and Drug Administration. 2021. “Statistical Principles for Clinical Trials E9 (R1): Addendum: Estimands and Sensitivity Analysis in Clinical Trials.” https://www.fda.gov/regulatory-information/search-fda-guidance-documents/e9r1-statistical-principles-clinical-trials-addendum-estimands-and-sensitivity-analysis-clinical.

van der Laan, Mark J. 2015. “Statistics as a Science, Not an Art: The Way to Survive in Data Science.” Amstat News. https://magazine.amstat.org/blog/2015/02/01/statscience_feb2015/.

Vu, Julie, and David Harrington. 2020. Introductory Statistics for the Life and Biomedical Sciences. OpenIntro. https://openintro.org/book/biostat.

Wainer, Howard. 1984. “How to Display Data Badly.” The American Statistician 38 (2): 137–47.

Weir, Isabelle R, Suzanne M Dufault, and Patrick P J Phillips. 2024. “Estimands for Clinical Endpoints in Tuberculosis Treatment Randomized Controlled Trials: A Retrospective Application in a Completed Trial.” Trials 25 (1). https://doi.org/10.1186/s13063-024-07999-w.