Introduction Sir David Cox observed that if the proportional hazards assumption holds (or, is assumed to hold) then it is possible to estimate the effect parameter(s), denoted \beta_i below, without any consideration of the full hazard function. This approach to survival data is called application of the
Cox proportional hazards model, sometimes abbreviated to
Cox model or to
proportional hazards model. However, Cox also noted that biological interpretation of the proportional hazards assumption can be quite tricky. is L(\beta) = \prod_{i=1}^ML_i(\beta) = \prod_{i:C_i=1} L_i(\beta), where the subjects for which an event has occurred are indicated by
Ci = 1 and all others by
Ci = 0. The corresponding log partial likelihood is \ell(\beta) = \sum_{i:C_i=1} \left(X_i \cdot \beta - \log \sum_{j:Y_j\ge Y_i}\theta_j\right), where we have written \sum_{j=i}^N using the indexing introduced above in a more general way, as \sum_{j:Y_j\ge Y_i}. Crucially, the effect of the covariates can be estimated without the need to specify the hazard function \lambda_0(t) over time. The partial likelihood can be maximized over
β to produce maximum partial likelihood estimates of the model parameters. The partial
score function is \ell^\prime(\beta) = \sum_{i:C_i=1} \left(X_i - \frac{\sum_{j:Y_j\ge Y_i}\theta_jX_j}{\sum_{j:Y_j\ge Y_i}\theta_j}\right), and the
Hessian matrix of the partial log likelihood is \ell^{\prime\prime}(\beta) = -\sum_{i:C_i=1} \left(\frac{\sum_{j:Y_j\ge Y_i}\theta_jX_jX_j^\prime}{\sum_{j:Y_j\ge Y_i}\theta_j} - \frac{\left[\sum_{j:Y_j\ge Y_i}\theta_jX_j\right] \left[\sum_{j:Y_j\ge Y_i}\theta_jX_j^\prime\right]}{\left[\sum_{j:Y_j\ge Y_i}\theta_j\right]^2}\right). Using this score function and Hessian matrix, the partial likelihood can be maximized using the
Newton-Raphson algorithm. The inverse of the Hessian matrix, evaluated at the estimate of
β, can be used as an approximate variance-covariance matrix for the estimate, and used to produce approximate
standard errors for the regression coefficients.
Likelihood when there exist tied times Several approaches have been proposed to handle situations in which there are ties in the time data. ''Breslow's method
describes the approach in which the procedure described above is used unmodified, even when ties are present. An alternative approach that is considered to give better results is Efron's method
. Let t''
j denote the unique times, let
Hj denote the set of indices
i such that
Yi =
tj and
Ci = 1, and let
mj = |
Hj|. Efron's approach maximizes the following partial likelihood. L(\beta) = \prod_j \frac{\prod_{i\in H_j}\theta_i}{\prod_{\ell=0}^{m_j-1} \left[\sum_{i:Y_i\ge t_j}\theta_i - \frac{\ell}{m_j} \sum_{i\in H_j} \theta_i\right] }. The corresponding log partial likelihood is \ell(\beta) = \sum_j \left(\sum_{i\in H_j} X_i \cdot \beta -\sum_{\ell=0}^{m_j-1}\log\left(\sum_{i:Y_i\ge t_j}\theta_i - \frac{\ell}{m_j} \sum_{i\in H_j}\theta_i\right)\right), the score function is \ell^\prime(\beta) = \sum_j \left(\sum_{i\in H_j} X_i -\sum_{\ell=0}^{m_j-1}\frac{\sum_{i:Y_i\ge t_j}\theta_iX_i - \frac{\ell}{m_j}\sum_{i\in H_j}\theta_iX_i}{\sum_{i:Y_i\ge t_j}\theta_i - \frac{\ell}{m_j}\sum_{i\in H_j}\theta_i}\right), and the Hessian matrix is \ell^{\prime\prime}(\beta) = -\sum_j \sum_{\ell=0}^{m_j-1} \left(\frac{\sum_{i:Y_i\ge t_j}\theta_iX_iX_i^\prime - \frac{\ell}{m_j}\sum_{i\in H_j}\theta_iX_iX_i^\prime}{\phi_{j,\ell,m_j}} - \frac{Z_{j,\ell,m_j} Z_{j,\ell,m_j}^\prime}{\phi_{j,\ell,m_j}^2}\right), where \phi_{j,\ell,m_j} = \sum_{i:Y_i\ge t_j}\theta_i - \frac{\ell}{m_j}\sum_{i\in H_j}\theta_i Z_{j,\ell,m_j} = \sum_{i:Y_i\ge t_j}\theta_iX_i - \frac{\ell}{m_j}\sum_{i\in H_j}\theta_iX_i. Note that when
Hj is empty (all observations with time
tj are censored), the summands in these expressions are treated as zero.
Examples Below are some worked examples of the Cox model in practice.
A single binary covariate Suppose the endpoint we are interested in is patient survival during a 5-year observation period after a surgery. Patients can die within the 5-year period, and we record when they died, or patients can live past 5 years, and we only record that they lived past 5 years. The surgery was performed at one of two hospitals,
A or
B, and we would like to know if the hospital location is associated with 5-year survival. Specifically, we would like to know the
relative increase (or decrease) in hazard from a surgery performed at hospital A compared to hospital B. Provided is some (fake) data, where each row represents a patient:
T is how long the patient was observed for before death or 5 years (measured in months), and
C denotes if the patient died in the 5-year period. We have encoded the hospital as a binary variable denoted
X: 1 if from hospital
A, 0 from hospital
B. Our single-covariate Cox proportional model looks like the following, with \beta_1 representing the hospital's effect, and
i indexing each patient: \overbrace{\lambda(t|X_{i})}^{\text{hazard for i}} = \underbrace{\lambda_0(t)}_{\text{baseline} \atop \text{hazard} }\cdot\overbrace{\exp(\beta_1 X_{i})}^{\text{scaling factor for i}} Using statistical software, we can estimate \beta_1 to be 2.12. The hazard ratio is the
exponential of this value, \exp(\beta_1) = \exp(2.12). To see why, consider the ratio of hazards, specifically: \frac{\lambda(t|X=1)}{\lambda(t|X=0)} = \frac{\cancel{\lambda_0(t)}\exp(\beta_1 \cdot 1)}{\cancel{\lambda_0(t)}\exp(\beta_1 \cdot 0)} = \exp(\beta_1) Thus, the hazard ratio of hospital A to hospital B is \exp(2.12) = 8.32 . Putting aside statistical significance for a moment, we can make a statement saying that patients in hospital A are associated with a 8.3x higher risk of death occurring in any short period of time compared to hospital B. There are important caveats to mention about the interpretation: • a 8.3x higher risk of death does not mean that 8.3x more patients will die in hospital A: survival analysis examines how quickly events occur, not simply whether they occur. • More specifically, "risk of death" is a measure of a rate. A rate has units, like meters per second. However, a
relative rate does not: a bicycle can go two times faster than another bicycle (the reference bicycle), without specifying any units. Likewise, the risk of death (comparable to the speed of a bike) in hospital
A is 8.3 times higher (faster) than the risk of death in hospital
B (the reference group). • the inverse quantity, 1/8.32 = \frac{1}{\exp(2.12)} = \exp(-2.12) = 0.12 is the hazard ratio of hospital
B relative to hospital
A. • We haven't made any inferences about
probabilities of survival between the hospitals. This is because we would need an estimate of the baseline hazard rate, \lambda_0(t), as well as our \beta_1 estimate. However, standard estimation of the Cox proportional hazard model does not directly estimate the baseline hazard rate. • Because we have ignored the only time varying component of the model, the baseline hazard rate, our estimate is timescale-invariant. For example, if we had measured time in years instead of months, we would get the same estimate. • It is tempting to say that the hospital
caused the difference in hazards between the two groups, but since our study is not causal (that is, we do not know how the data was generated), we stick with terminology like "associated".
A single continuous covariate To demonstrate a less traditional use case of survival analysis, the next example will be an economics question: what is the relationship between a company's price-to-earnings ratio (P/E) on their first IPO anniversary and their future survival? More specifically, if we consider a company's "birth event" to be their first IPO anniversary, and any bankruptcy, sale, going private, etc. as a "death" event the company, we'd like to know the influence of the companies' P/E ratio at their "birth" (first IPO anniversary) on their survival. Provided is a (fake) dataset with survival data from 12 companies:
T represents the number of days between first IPO anniversary and death (or an end date of 2022-01-01, if did not die).
C represents if the company died before 2022-01-01 or not. P/E represents the company's price-to-earnings ratio at its 1st IPO anniversary. Unlike the previous example where there was a binary variable, this dataset has a continuous variable, P/E; however, the model looks similar: \lambda(t|P_{i}) = \lambda_0(t)\cdot\exp(\beta_1 P_{i}) where P_i represents a company's P/E ratio. Running this dataset through a Cox model produces an
estimate of the value of the unknown \beta_1, which is -0.34. Therefore, an estimate of the entire hazard is: \lambda(t|P_{i}) = \lambda_0(t)\cdot\exp(-0.34 P_{i}) Since the baseline hazard, \lambda_0(t), was not estimated, the entire hazard is not able to be calculated. However, consider the ratio of the companies
i and
j's hazards: \begin{align} \frac{\lambda(t|P_{i})}{\lambda(t|P_{j})} &= \frac{ \cancel{\lambda_0(t)}\cdot\exp(-0.34 P_{i})}{\cancel{\lambda_0(t)}\cdot \exp(-0.34 P_{j})} \\ &= \exp(-0.34 (P_{i} - P_{j})) \end{align} All terms on the right are known, so calculating the ratio of hazards between companies is possible. Since there is no time-dependent term on the right (all terms are constant), the hazards are
proportional to each other. For example, the hazard ratio of company 5 to company 2 is \exp(-0.34 (6.3 - 3.0)) = 0.33. This means that, within the interval of study, company 5's risk of "death" is 0.33 ≈ 1/3 as large as company 2's risk of death. There are important caveats to mention about the interpretation: • The
hazard ratio is the quantity \exp(\beta_1), which is \exp(-0.34) = 0.71 in the above example. From the last calculation above, an interpretation of this is as the ratio of hazards between two "subjects" that have their variables differ by one unit: if P_{i} = P_{j} + 1, then \exp(\beta_1 (P_{i} - P_{j}) = \exp(\beta_1 (1)). The choice of "differ by one unit" is convenience, as it communicates precisely the value of \beta_1. • The baseline hazard can be represented when the scaling factor is 1, i.e. P=0. \lambda(t|P_{i}=0) = \lambda_0(t)\cdot\exp(-0.34 \cdot 0) = \lambda_0(t) Can we interpret the baseline hazard as the hazard of a "baseline" company whose P/E happens to be 0? This interpretation of the baseline hazard as "hazard of a baseline subject" is imperfect, as the covariate being 0 is impossible in this application: a P/E of 0 is meaningless (it means the company's stock price is 0, i.e., they are "dead"). A more appropriate interpretation would be "the hazard when all variables are nil". • It is tempting to want to understand and interpret a value like \exp(\beta_1 P_{i}) to represent the hazard of a company. However, consider what this is actually representing: \exp(\beta_1 P_{i}) = \exp(\beta_1 (P_{i}-0))= \frac{\exp(\beta_1 P_{i})}{\exp(\beta_1 0)} = \frac{\lambda(t|P_{i})}{\lambda(t|0)}. There is implicitly a ratio of hazards here, comparing company
i's hazard to an imaginary baseline company with 0 P/E. However, as explained above, a P/E of 0 is impossible in this application, so \exp(\beta_1 P_{i}) is meaningless in this example. Ratios between plausible hazards are meaningful, however. ==Time-varying predictors and coefficients==