- The project is out-of-sync -- use `renv::status()` for details.
Introduction to the analysis of time-to-event data
June 21, 2025
- The project is out-of-sync -- use `renv::status()` for details.
Survival analysis is studied distinctly from other topics since survival data have some important distinguishing features.
Nonetheless, our tour of survival analysis will (re)visit many of the topics we have already learned considered so far
Survival analysis is the area of (bio)statistics that deals with time-to-event data.
“Survival” analysis isn’t restricted to analyzing time until death: Applicable whenever the time until study subjects experience a relevant endpoint is of interest.
The terms “failure” and “event” are used interchangeably for the endpoint of interest.
Example: Researchers record time from diagnosis until death for a sample of patients diagnosed with laryngeal cancer.
What if the mean or median survival time were reported?
What if survival rate at
Denote the measured time by the RV
Time is recorded relative to when a person enters the study, not according to the calendar.
The probability of an individual in our population surviving past a given time
Here, we are interested in inference about a function, not just a single population value (e.g.,
Can we estimate
Analyses of survival times usually include censored/missing data; valid inference with missing data is a topic of ongoing research in statistics – generally, assumptions are required.
Time-to-event data exhibit a particular form of missingness: units are lost to follow-up (or the study ends) prior to their experiencing the event of interest; this is called right-censoring.
Censored data provide partial information: we don’t know how long a patient lived, but we know they lived at least as long as the time elapsed prior to being lost to follow-up.
Why would a person be lost to follow-up? They could have…
Critical assumption: Censoring is non-informative—that is, assume that being lost to follow-up is unrelated to prognosis. (Is this plausible?)
If this assumption can’t be made, inference becomes more complicated and requires strong(er) assumptions.
Counterexample: Researchers administer a new chemotherapy drug to 10 cancer patients to assess survival while on the drug.
If non-informative censoring were assumed, the drug would probably appear falsely impressive.
The Kaplan-Meier estimator (Kaplan and Meier 1958) gives an estimate of
For specific times
Probability of surviving to time
What if we applied this recursive trick repeatedly?
Survival function
Without censoring:
Note: a patient alive but censored at
Under censoring, we estimate
Denominator counts those at risk for the event at time
With independent censoring, we can invoke the logic of conditional probability:
For convenience, define the following quantities
Now, notice that
From this, at observed times
Consider
year | fail | censored |
---|---|---|
2 | 7 | 2 |
4 | 16 | 5 |
6 | 19 | 8 |
8 | 14 | 7 |
10 | 11 | 4 |
12 | 5 | 2 |
Computing estimates survival probabilities with the KM estimator just applies the formula:
The Kaplan-Meier Estimator
Estimating
Using a log-transformation for this improves upon the normal approximation. A step-by-step sketch:
In addition to estimating survival, we may want to compare two groups’ survival functions formally (i.e., via a hypothesis test).
Rather than testing for differences in a summary measure (e.g., mean in groups 1 vs. 2), we compare the entire survival curves.
To precisely define what we’re testing, we need to define a new function called the hazard function,
The hazard function
Equivalently, this is the instantaneous conditional event rate
In discrete time, this is the probability of an event occurring at time
Special relationships of the survival function
See Freedman (2008) for an overview of these critical relationships.
Hazard functions are natural constructs for handling censoring.
Formulate null and alternative hypotheses to test whether two groups have different survival functions based on the hazard:
The log-rank test subdivides time into
For continuous outcomes (
For binary outcomes (
What about for survival data? Any ideas?
Survival curves don’t necessarily follow a given distribution…
Cox’s proportional hazards model is a special semiparametric regression heavily used in survival analysis (Cox 1972).
Assumption: hazard
In this regression model,
Given hazard rate
Proportional hazards: Unit change in covariate
We’ve said nothing about the baseline hazard rate
This component is an instance of the CodeMirror interactive text editor. The editor has been configured so that the Tab key controls the indentation of code. To move focus away from the editor, press the Escape key, and then press the Tab key directly after it. Escape and then Shift-Tab can also be used to move focus backwards.
HST 190: Introduction to Biostatistics