- The project is out-of-sync -- use `renv::status()` for details.
Comparing two population parameters
Two-sample data can be paired or unpaired (independent).
Paired measurements for each “participant” or study unit
each unit can be matched to another unit in the data
e.g., “enriched” vs. “normal” environments for pairs of rats, with pairs taken from the same litter
Two independent sets of measurements
observations cannot be matched on a one-to-one basis
e.g., rats from several distinct litters observed in either “enriched” or “normal” environments
Nature of the data dictates testing procedure: two-sample test for paired data or two-sample test for independent group data.
Example: Menstruation’s effect on energy intake
Does the dietary (energy) intake of women differ pre- versus post-menstruation (across 10 days)? (Table 9.3 of Altman 1990)
A study was conducted to assess the effect of menstruation on energy intake.
women’s dietary (energy) intake was measured pre- and post- (10 days)
Investigators recorded energy intake for each of the women (in kJ)
pre
post
5260
3910
5470
4220
5640
3885
6180
5160
6390
5645
6515
4680
The paired -test
The energy intake measurements are paired within a given study unit (each of the women)—use this structure to advantage, i.e., exploit homogeneity.
For each woman , we have measurements and .
Define the measurement difference:
is the energy intake pre-menstruation for woman
is the energy intake post-menstruation for woman
Base inference on , sample mean of the , that is
The paired -test
Let be the population mean of the difference in energy intake for 10-day intervals pre- and post-menstruation for the population of women from which this random sample (of ) was taken.
The null and alternative hypotheses are
, no difference in energy intake pre- vs. post-
i.e., no effect of menstruation on subsequent energy intake
, there is a difference in energy intake pre- vs. post-
i.e., menstruation does have an effect on energy intake
The paired -test and confidence interval
The test statistic for the paired -test is
-test and confidence interval for paired data
where is the mean of the paired differences , is the sample standard deviation of the , and is the number of differences (i.e., number of pairs).
A paired -test is just a one-sample test of the differences ; the -value may be calculated from a distribution with . In the above, recall that is the standard error of the mean of the differences, .
Paired t-test
data: intake$pre and intake$post
t = 11.941, df = 10, p-value = 3.059e-07
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
1074.072 1566.838
sample estimates:
mean difference
1320.455
One Sample t-test
data: intake[, pre - post]
t = 11.941, df = 10, p-value = 3.059e-07
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
1074.072 1566.838
sample estimates:
mean of x
1320.455
Welch Two Sample t-test
data: intake$pre and intake$post
t = 2.6242, df = 19.92, p-value = 0.01629
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
270.5633 2370.3458
sample estimates:
mean of x mean of y
6753.636 5433.182
Assumptions for paired -test
Like the one-sample version, paired -test requires assumptions:
Differences arise as iid samples from , that is, from a normal distribution…
…assumed to have zero mean (under of no difference1)
…assumed to have unknown variance (i.e., )
Additionally requirements include independence of pairs and non-interference (that is, one pair cannot affect another)
For example, difference of pre- and post-menstruation energy intake assumed to be normally distributed
FAMuSS: Comparing arm strength by sex
Question: Does change in non-dominant arm strength (ndrm.ch) after resistance training differ between men and women?
Framing the question — the null and alternative hypotheses are
, population mean change in arm strength for women same as population mean change for men
Equivalently, let , then
, population mean change in arm strength for women differs from population mean change for men
Hypotheses may generically be written in terms of and .
The parameter of interest is .
The point estimate is .
The two-group (independent) -test
The test statistic for the two-group (independent) -test is
Two-group (independent) -test
where is an estimate of the standard errors of the two group means.
The two-sample -test compares between-group differences; the -value may be calculated from the distribution, but the differ.
for two-group (independent) -test
A conservative approximation is
The Welch-Satterthwaite approximation1 (used by R):
Alternative is to use Student’s version , under a (strong!) assumption of equal variances of the two groups and
Confidence intervals for the two-group (independent) difference-in-means
A (100)% confidence interval (CI) for the difference in the two population means is of the form
where is the critical point (i.e., quantile) on a distribution (with same as for corresponding -test) and area in either tail.
Letting R do the work
Welch Two Sample t-test
data: famuss$ndrm.ch[female] and famuss$ndrm.ch[male]
t = 10.073, df = 574.01, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
19.07240 28.31175
sample estimates:
mean of x mean of y
62.92720 39.23512
uses
Two Sample t-test
data: famuss$ndrm.ch[female] and famuss$ndrm.ch[male]
t = 9.1425, df = 593, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
18.60259 28.78155
sample estimates:
mean of x mean of y
62.92720 39.23512
uses1
Assumptions for two-group (independent) -test
Observations arise as iid samples from , that is, from a normal distribution common between the two populations…
…assumed to have same population mean (under 1)
…assumed to have same unknown variance2 (i.e., )
Additional requirements include independence of observations and non-interference (one observation cannot affect another)
For example, change in non-dominant arm strength for men and women is assumed to follow with a common variance , or and for males, females
Permutation-based hypothesis testing
Permutation testing1 is a nonparametric framework for inference
aim is to limit assumptions about the underlying populations, instead using structure that can be induced by design
method: construct the null distribution of a test statistic via (artificial) randomization, as if assignment2 was controlled
Permutation/randomization inference evaluates evidence against a different null hypothesis ()
Assumptions for permutational -test
Unlike the paired or two-group -test discussed so far,
…there is no need to assume a specific distributional form for how the data—observations or paired differences—arise
…non-interference issues (units cannot affect each other) are resolved by the construction of the null hypothesis
So…what do we need then?
Groups (or differences) are exchangeable (under )—no relationship between assignment and outcome
Randomization (of assignment) is performed fairly
The permutational -test in action
Exact Two-Sample Fisher-Pitman Permutation Test
data: ndrm.ch by sex (Female, Male)
Z = 8.5664, p-value < 2.2e-16
alternative hypothesis: true mu is not equal to 0
Exact: How many ways to choose 242 males from 595 individuals? 1.294158^{173}!
Approximative Two-Sample Fisher-Pitman Permutation Test
data: ndrm.ch by sex (Female, Male)
Z = 8.5664, p-value < 1e-04
alternative hypothesis: true mu is not equal to 0
Approximate: What about 10000? Is that enough?
Null hypotheses: The weak and the sharp
Data on units, measured as , under two conditions and , e.g., patients assigned treatment or placebo
Weak null: averages
Is there no effect on average?
Sharp (strong) null: individuals
Is there no effect for everyone?
The sharp null implies the weak null—no effect at all, no effect on average too.
FAMuSS: Comparing arm strength by genotype
Going beyond two-group comparisons
Is change in non-dominant arm strength after resistance training associated with genotype?
: mean change in arm strength is equal across the three genotypes
: at least one genotype has mean change in arm strength differing from others
Analysis of Variance (ANOVA)
Suppose we are interested in comparing means across more than two groups. Why not conduct several two-sample -tests?
If there are groups, then -tests are needed.
Conducting multiple tests on the same data increases the overall rate of Type I error, necessitating a multiplicity correction.
ANOVA: Assesses equality of means across many groups:
: means equal across all groups ()
: at least one mean differs from others (i.e., not all equal)
ANOVA: Are the groups all the same?
Under , there is no real difference between the groups—so any observed variation in group means must be due to chance alone.
Think of all observations as belonging to a single, large group.
Variability between group means Variability within groups
ANOVA exploits differences in means from an overall mean and within-group variation to evaluate equality of a few group means.
ANOVA: Are the groups all the same?
Is the variability in the sample means large enough that it seems unlikely to be due to chance alone?
Compare two quantities:
Variability between groups (1): how different are the group means from each other, i.e., relative to the overall mean?
Variability within groups (2): how variable are the data within each group?
ANOVA: The -statistic
The -test for ANOVA
ANOVA measures discrepancies via the F-statistic, which follows an distribution:
when the population means are equal.
, larger values, stronger evidence against .
The -statistic follows an distribution, with two degrees of freedom, , .
-value: Probability is larger than the -statistic under .
Assumptions for ANOVA
Just like the -test, ANOVA requires assumptions about the data
Observations are independent within and across groups
Data within each group arise from a normal distribution
Variability across the groups is about equal
When these assumptions are violated, hard to know whether being rejected is due to evidence or failure of the assumptions…
As with the -test, a permutational -test can be formulated.
Pairwise comparisons
If the -test indicates sufficient evidence of inequality of the group means, pairwise comparisons to identify the group may follow.
Pairwise comparisons may use the two-group (independent) -test:
To maintain overall Type I error rate at , each comparison should be conducted at at an adjusted significance level, .
The Bonferroni correction1 is one method for adjusting , using , where for groups.
Note: The Bonferroni correction is conservative (i.e., stringent), and assumes all tests are independent.
Letting R do the work
# use summary(aov())summary(aov(famuss$ndrm.ch ~ famuss$actn3.r577x))
Df Sum Sq Mean Sq F value Pr(>F)
famuss$actn3.r577x 2 7043 3522 3.231 0.0402 *
Residuals 592 645293 1090
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: , sufficient evidence to reject in favor of —so at least one group differs in mean from the others.
But which group(s) is (are) the source of the difference? 🧐
Controlling Type I error rate
If the -test indicates sufficient evidence of inequality of the group means, pairwise comparisons (-tests1) may identify those groups.
Each test should be conducted at the significance level so that the overall Type I error rate remains at .
These -tests are conducted under the assumption that the between-group variance is equal, using the pooled SD estimate.
We will use pairwise.t.test() to perform these post hoc two-sample t-tests.
Controlling Type I error rate
Pairwise comparisons using two-sample -tests (CC to CT, CC to TT, CT to TT) can now be done if the Type I error rate is controlled.
Apply the Bonferroni correction.
In this setting, .
Examine each of the three two-sample -tests, evaluating evidence for group mean differences at the more stringent level of .
Letting R do the work
Only CC versus TT resulted in a -value less than .
Mean strength change in non-dominant arm for individuals with genotype CT is not distinguishable from strength change for those with CC and TT.
However, evidence at level that mean strength change for individuals of genotype CC and TT are different.
Pairwise comparisons using t tests with pooled SD
data: famuss$ndrm.ch and famuss$actn3.r577x
CC CT
CT 0.179 -
TT 0.011 0.144
P value adjustment method: none
Pairwise comparisons using t tests with pooled SD
data: famuss$ndrm.ch and famuss$actn3.r577x
CC CT
CT 0.537 -
TT 0.034 0.433
P value adjustment method: bonferroni
Type I error rate for a single test
Hypothesis testing was intended for controlled experiments or for studies with only a few comparisons (e.g., ANOVA).
Type I errors (rejecting when true) occur with probability .
Type I error rate is controlled by rejecting only when the -value of a test is smaller than (where is typically kept low).
With a single two-group comparison at , 5% chance of incorrectly identifying an association where none exists.
And what about many tests?
Multiple testing–compounding error
What happens to Type I error when making several comparisons? When conducting many tests, many chances to make a mistake.
The significance level () used in each test controls the error rate for that test only.
Experiment-wise error rate: the chance of at least one test incorrectly rejecting when all of the null hypotheses are true (the “global null”).
Controlling the experiment-wise error1 rate is just one approach for controlling the Type I error.
Probability of experiment-wise error
A scientist is using two -tests to examine a possible association of each of two genes with a disease type. Assume the tests are independent and each are conducted at level .
: event of making a Type I error on the first test,
: event of making a Type I error on the second test,
Control for overall .
Example of compounding of Type I error
Probability of making at least one error is equal to the complement of the event that a Type I error is not made with either test.
Probability of experiment-wise error…
10 tests…
25 tests…
100 tests…
With 100 independent tests: 99.4% chance an investigator will make at least one Type I error!
Advanced melanoma
Advanced melanoma is an aggressive form of skin cancer that until recently was almost uniformly fatal.
Research is being conducted on therapies that might be able to trigger immune responses to the cancer that then cause the melanoma to stop progressing or disappear entirely.
In a study where 52 patients were treated concurrently with 2 new therapies, nivolumab and ipilimumab, 21 had immune responses.1
Advanced melanoma
Some research questions that can be addressed with inference…
What is the population probability of immune response following concurrent therapy with nivolumab and ipilimumab?
What is a 95% confidence interval for the population probability of immune response following concurrent therapy with nivolumab and ipilimumab?
In prior studies, the proportion of patients responding to one of these agents was 30% or less. Do these data suggest a better (>30%) probability of response to the concurrent therapy?
Inference for binomial proportions
In this study of melanoma, experiencing an immune response to the concurrent therapy is a binary event (i.e., binomial data).
Suppose is a binomial RV with parameters , the number of trials, and , the probability of success.
Goal: Inference about population parameter , the probability of success in the population1.
is the point estimate of from the observed sample, where is the observed number of successes.
Assumptions for using the normal distribution
The sampling distribution of is approximately normal1 when
The sample observations are independent, and
At least 10 successes and 10 failures are expected in the sample: and .2
Under these conditions, is approximately normally distributed with mean and standard error , estimated as .
Inference with the normal approximation
For CIs, use instead of .
An approximate two-sided (100)% confidence interval for is given by
prop.test(x =21, n =52, conf.level =0.95)$conf.int
prop.test(x =21, n =52, p =0.30, alternative ="greater")
1-sample proportions test with continuity correction
data: 21 out of 52, null probability 0.3
X-squared = 2.1987, df = 1, p-value = 0.06906
alternative hypothesis: true p is greater than 0.3
95 percent confidence interval:
0.2906582 1.0000000
sample estimates:
p
0.4038462
Exact inference for binomial data
An exact procedure does not reply upon a limit approximation.1
The Clopper-Pearson CI is an exact method for binomial CIs: , where
Note: While exactness is by construction, the CIs can be conservative (i.e., too large) relative to those from an approximation.
# use pbinompbinom(20, 52, p =0.30, lower.tail =FALSE)
[1] 0.07167176
# use binom.testbinom.test(x =21, n =52, p =0.30,alternative ="greater")
Exact binomial test
data: 21 and 52
number of successes = 21, number of trials = 52, p-value = 0.07167
alternative hypothesis: true probability of success is greater than 0.3
95 percent confidence interval:
0.2889045 1.0000000
sample estimates:
probability of success
0.4038462
Inference for difference of two proportions
The normal approximation can be applied to if
The samples are independent, the observations in each sample are independent, and
At least 10 successes and 10 failures are expected in each sample.
The standard error of the difference in sample proportions is
Treating HIV infants
In resource-limited settings, single-dose nevirapine is given to HIV women during birth to prevent mother-to-child transmission of HIV.
Exposure of the infant to nevirapine (NVP) may foster growth of resistant strains of the virus in the child.
If the child is HIV, should he/she be treated with nevirapine or a more expensive drug, lopinarvir (LPV)?
Here, the possible outcomes are virologic failure (virus becomes resistant) versus stable disease (virus growth is prevented).
Treating HIV infants
The results of a study comparing NVP vs. LPV in treatment of HIV-infected infants.1 Children were randomized to receive NVP or LPV.
NVP
LPV
Total
Virologic Failure
60
27
87
Stable Disease
87
113
200
Total
147
140
287
Is there evidence of a difference in NVP vs. LPV? How to see this?2
Formulating hypotheses in a two-way table
Do data support the claim of a differential outcome by treatment?
If there is no difference in outcome by treatment, then knowing treatment provides no information about outcome, i.e., treatment assignment and outcome are independent (i.e., not associated).
: Treatment and outcome are not associated.
: Treatment and outcome are associated.
Question: What would we expect if no association (i.e., under )?
The test of independence
Idea: How do the observed cell counts differ from those expected (under , i.e., as if the null hypothesis were true)? Let’s use this.
The Pearson test of independence formulates a test statistic to quantify the magnitude of deviation of observed results from what would be expected under .
Large test statistic: Stronger evidence against the null hypothesis of independence.
Small test statistic: Weaker evidence (or lack of) against the null hypothesis of independence.
Assumptions for the test
Independence: Each case contributing a count to the table must be independent of all other cases in the table.
Sample size: Each expected cell count must be greater than or equal to 10.1 (Without enough data, what are you comparing?)
Under these assumptions, the Pearson test statistic attains a distribution with tied to the number of comparisons.
The test statistic
The test statistic is calculated as
The test
and is approximately distributed with , where is the number of rows and is the number of columns (e.g., in a table, ).
represents the observed count in row , column ; is its expected counterpart1.
Applying the test: Treating HIV infants
If treatment has no effect on outcome, what would we expect?
Under the null hypothesis of independence, where = {assignment to NVP} and = {virologic failure}. So, the expected cell count in the upper left corner of the table from Violari et al. (2012) would be Thus, the scaled deviation, , would be
Same logic applies for other cells…repeat three times and sum to get the test statistic.
By comparing cells of these two tables (of and ), we get the test statistic.
R’s chisq.test() does all of this:
chisq.test(hiv_table)
Pearson's Chi-squared test with Yates' continuity correction
data: hiv_table
X-squared = 14.733, df = 1, p-value = 0.0001238
Conclusion: The test finds evidence against the claimed independence of NPV and virologic failure at most reasonable significance standards (-value < 0.001 and -value 13).
Fisher’s exact test
R.A. Fisher proposed an exact test for contingency tables by exploiting the counts and margins, with the “reasoned basis” for the test being randomization and exchangeability (Fisher 1936).
NVP
LPV
Total
Virologic Failure
a
b
a+b
Stable Disease
c
d
c+d
Total
a+c
b+d
n
Here, , seeing as many cases of virologic failure in the NVP group as we do see, follows exactly a hypergeometric distribution.
R’s fisher.test() does this:
fisher.test(hiv_table)
Fisher's Exact Test for Count Data
data: hiv_table
p-value = 0.0001037
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.643280 5.127807
sample estimates:
odds ratio
2.875513
Conclusion: Evidence for independence of NPV, virologic failure lacking ().
Lowering stakes: The lady tasting tea
“Dr. Muriel Bristol, a colleague of Fisher’s, claimed that when drinking tea she could distinguish whether milk or tea was added to the cup first (she preferred milk first). To test her claim, Fisher asked her to taste eight cups of tea, four of which had milk added first and four of which had tea added first.” Agresti (2012)
If the lady could not discriminate tea-milk ordering
Tea first (real)
Milk first (real)
Total
Tea first (thinks)
2
2
4
Milk first (thinks)
2
2
4
Total
4
4
8
: Order doesn’t impact taste, so Dr. Bristol can’t tell
: Order does impact taste and Dr. Bristol can tell
What’s the probability of getting all four right?
If the lady could perfectly discriminate the ordering
Tea first (real)
Milk first (real)
Total
Tea first (thinks)
4
0
4
Milk first (thinks)
0
4
4
Total
4
4
8
Fisher's Exact Test for Count Data
data: ladys_tea
p-value = 0.02857
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.339059 Inf
sample estimates:
odds ratio
Inf
The relative risk in a table
Relative risk (RR) measures the risk of an event occurring in one group relative to risk of same event occurring in another group.
The risk of virologic failure among the NVP group is
The risk of virologic failure among the LPV group is
RR of virologic failure for NVP vs. LPV is , so children treated with NVP are estimated to be more than twice as likely to experience virologic failure.
The odds ratio in a table
Odds ratio (OR) measures the odds of an event occurring in one group relative to the odds of the event occurring in another group.
The odds of virologic failure among the NVP group is
The odds of virologic failure among the LPV group is
OR of virologic failure for NVP vs. LPV is , so the odds of virologic failure when given NVP are nearly three times as large as the odds when given LPV.
Relative risk versus odds ratio
The relative risk cannot be used in studies that use outcome-dependent sampling, such as a case-control study:
Suppose in the HIV study, researchers had identified 100 HIV-positive infants who had experienced virologic failure (cases) and 100 who had stable disease (controls), then recorded the number in each group who had been treated with NVP or LPV.
With this design, the sample proportion of infants with virologic failure no longer estimates the population proportion (it is biased by design).
Similarly, the sample proportion of infants with virologic failure in a treatment group no longer estimates the proportion of infants who would experience virologic failure in a hypothetical population treated with that drug.
The odds ratio remains valid even when it is not possible to estimate incidence of an outcome from sample data.
References
Agresti, Alan. 2012. Categorical Data Analysis. Vol. 792. John Wiley & Sons.
Altman, Douglas G. 1990. Practical Statistics for Medical Research. CRC Press.
Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.”Journal of the Royal Statistical Society: Series B (Statistical Methodology) 57 (1): 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.
Fisher, Ronald Aylmer. 1936. “Design of Experiments.”British Medical Journal 1 (3923): 554.
Violari, Avy, Jane C Lindsey, Michael D Hughes, Hilda A Mujuru, Linda Barlow-Mosha, Portia Kamthunzi, Benjamin H Chi, et al. 2012. “Nevirapine Versus Ritonavir-Boosted Lopinavir for HIV-Infected Children.”New England Journal of Medicine 366 (25): 2380–89. https://doi.org/10.1056/NEJMoa1113249.
Vu, Julie, and David Harrington. 2020. Introductory Statistics for the Life and Biomedical Sciences. OpenIntro. https://openintro.org/book/biostat.
Wolchok, Jedd D, Harriet Kluger, Margaret K Callahan, Michael A Postow, Naiyer A Rizvi, Alexander M Lesokhin, Neil H Segal, et al. 2013. “Nivolumab Plus Ipilimumab in Advanced Melanoma.”New England Journal of Medicine 369 (2): 122–33. https://doi.org/10.1056/NEJMoa1302369.
9
1
›
This component is an instance of the CodeMirror interactive text editor. The editor has been configured so that the Tab key controls the indentation of code. To move focus away from the editor, press the Escape key, and then press the Tab key directly after it. Escape and then Shift-Tab can also be used to move focus backwards.
WebR is downloading, please wait...
WebR is downloading, please wait...
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
WebR is downloading, please wait...
Elements of Statistical Inference, Part II Nima Hejazi nhejazi@hsph.harvard.edu Harvard Biostatistics June 19, 2025