Unbox the Cox: A Hidden Dark Secret of Cox Regression

Diving into the perfect predictors
If you have been following my previous blog posts, you might recall that logistic regression encounters a problem when trying to fit perfectly separated data, leading to an infinite odds ratio. In Cox Regression, where hazard replaces odds, you might wonder if a similar issue arises with perfect predictors. It does occur, but unlike logistic regression, it is much less apparent how this occurs here and even what constitutes "perfect predictors". As will become more clear later, perfect predictors are defined as predictors x whose ranks exactly match the ranks of event times (their Spearman correlation is one).
Previously, on "Unbox the Cox":
we explained maximum likelihood estimation and introduced a made-up dataset with five subjects, where a single predictor, x, represented the dosage of a life-extending drug. To make x a perfect predictor of event times, here we swapped the event times for subjects C and D:
import numpy as np
import pandas as pd
import plotnine as p9
from cox.plots import (
plot_subject_event_times,
animate_subject_event_times_and_mark_at_risk,
plot_cost_vs_beta,
)
perfect_df = pd.DataFrame({
'subject': ['A', 'B', 'C', 'D', 'E'],
'time': [1, 3, 4, 5, 6],
'event': [1, 1, 1, 1, 0],
'x': [-1.7, -0.4, 0.0, 0.9, 1.2],
})
plot_subject_event_times(perfect_df, color_map='x')

To understand why these "perfect predictors" can be problematic, let's pick up right where we left off and check the negative log-likelihood cost plotted against β:
negloglik_sweep_betas_perfect_df = neg_log_likelihood_all_subjects_sweep_betas(
perfect_df,
betas=np.arange(-5, 5, 0.1)
)
plot_cost_vs_beta(negloglik_sweep_betas_perfect_df, width=0.1)

You can see right away that there is no minimum value of β any longer: if we use very large negative values of β, we end up with log-likelihood fits that are near-perfect for all events.
Now, let's dive into the math behind this and take a look at the likelihood of event A. We'll dig into how the numerator and denominator change as we tweak β:

When β is high or a large positive number, the last term in the denominator (with the largest x of 1.2), representing the hazard of subject E, dominates the entire denominator and becomes exceedingly large. So, the likelihood becomes small and approaches zero:

This results in a big negative log-likelihood. The same thing happens for each individual likelihood because the last hazard of subject E, will always exceed any hazard in the numerator. As a result, the negative log-likelihood increases for subjects A to D. In this case, when we have high β, it brings down all the likelihoods, resulting in poor fits for all events.
Now, when β is low or a big negative number, the first term in the denominator, representing the hazard of subject A, dominates since it has the lowest x value. As the same hazard of subject A also appears in the numerator, the likelihood L(A) can be arbitrarily close to 1 by making β increasingly negative, thereby creating an almost perfect fit:

The same deal goes for all the other individual likelihoods: negative βs now boost the likelihoods of all events at the same time. Basically, having a negative __ β doesn't come with any downsides. At the same time, certain individual hazards increase (subjects A and B with negative x), some stay the same (subject C with x = 0), and others decrease (subject D with positive x). But remember, what really matters here are ratios of hazards. We can verify this hand-waving math by plotting individual hazards:
def plot_likelihoods(df, ylim=[-20, 20]):
betas = np.arange(ylim[0], ylim[1], 0.5)
subjects = df.query("event == 1")['subject'].tolist()
likelihoods_per_subject = []
for subject in subjects:
likelihoods = [
np.exp(log_likelihood(df, subject, beta))
for beta in betas
]
likelihoods_per_subject.append(
pd.DataFrame({
'beta': betas,
'likelihood': likelihoods,
'subject': [subject] * len(betas),
})
)
lik_df = pd.concat(likelihoods_per_subject)
return (
p9.ggplot(lik_df, p9.aes('beta', 'likelihood', color='subject'))
+ p9.geom_line(size=2)
+ p9.theme_classic()
)
plot_likelihoods(perfect_df)

The way likelihoods are put together, as a ratio of a hazard to the sum of hazards of all subjects still at risk, means that negative β values make a perfect fit for the likelihood of each subject whose event time rank is greater than or equal to the predictor rank! As a side note, if x had a perfect negative Spearman correlation with event times, things would be flipped around: arbitrarily positive _β_s would give us arbitrarily good fits.
Misaligned predictor and time ranks
We can actually see this and show you what goes down when event time ranks and predictor ranks do not line up using a another made-up example:
sample_df = pd.DataFrame({
'subject': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'x': [-1.7, -0.4, 0.0, 0.5, 0.9, 1.2, 1.3, 1.45],
'time': [1, 2, 4, 3, 5, 7, 6, 8],
'rank_x': [1, 2, 3, 4, 5, 6, 7, 8],
'event': [1, 1, 1, 1, 1, 1, 1, 0],
})
sample_df

In this particular example, the time
column ranges from 1 to 8, where each value represents its own rank. We also have an x_rank
column, which ranks the predictors x. Now, here's the key observation: for subjects D and G, their x_rank
is actually higher than their corresponding time
rank. As a result, the likelihoods of D and G won't experience the cancellation effect between the numerator and denominator when we have large negative values of β:


Their likelihoods are now maximal at some intermediate finite values of β. Let's take a look at a plot of individual likelihoods to see this:
plot_likelihoods(sample_df)

These "misaligned" ranks between a time and a predictor play a crucial role: they stop all likelihoods from essentially collapsing into one when we have significantly negative _β_s.
In conclusion, in Cox regression, in order to obtain a finite coefficient β for a predictor x, we require at least one instance where the rank of the predictor x is lower than the rank of the event time.
Perfect is indeed the enemy of the good (p-value)
So, how do these perfect predictors actually behave in real-life scenarios? Well, to find out, let's once again turn to the lifelines library for some investigation:
from lifelines import CoxPHFitter
perfect_cox_model = CoxPHFitter()
perfect_cox_model.fit(
perfect_df,
duration_col='time',
event_col='event',
formula='x'
)
perfect_cox_model.print_summary()
#> /.../coxph_fitter.py:1586: ConvergenceWarning:
#> The log-likelihood is getting suspiciously close to 0 and the delta is still large.
#> There may be complete separation in the dataset.
#> This may result in incorrect inference of coefficients.
#> See https://stats.stackexchange.com/q/11109/11867 for more.
#> /.../__init__.py:1165: ConvergenceWarning:
#> Column x has high sample correlation with the duration column.
#> This may harm convergence.
#> This could be a form of 'complete separation'.
#> See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression
#> /.../coxph_fitter.py:1611:
#> ConvergenceWarning: Newton-Rhaphson failed to converge sufficiently.
#> Please see the following tips in the lifelines documentation:
#> https://lifelines.readthedocs.io/en/latest/Examples.html#problems-with-convergence-in-the-cox-proportional-hazard-model

Just like in logistic regression, we're encountering convergence warnings and getting extremely wide confidence intervals for our predictor coefficient β. As a result, we end up with a p-value of 0.93!
If we simply filter models based on p-values without taking this issue into account or conducting further investigation, we would overlook these perfect predictors.
To tackle this convergence problem, the lifelines library documentation and some helpful StackOverflow threads suggest a potential solution: incorporating a regularization term into the cost function. This term effectively increases the cost for large coefficient values, and you can activate L2 regularization by setting the penalizer
argument to a value greater than zero:
perfect_pen_cox_model = CoxPHFitter(penalizer=0.01, l1_ratio=0)
perfect_pen_cox_model.fit(perfect_df, duration_col='time', event_col='event', formula='x')
perfect_pen_cox_model.print_summary()

This approach fixes the convergence warning, but it doesn't really make a huge dent in shrinking that pesky p-value. Even with this regularization trick, the p-value for a perfect predictor still hangs around at a somewhat large value 0.11.
Time is relative: only ranks matter
Lastly, we'll verify that the absolute values of event times have no impact on a Cox regression fit, using our previous example. To do this, we'll introduce a new column called time2
, which will contain random numbers in the same order as the time
column:
sample_df = pd.DataFrame({
'subject': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'x': [-1.7, -0.4, 0.0, 0.5, 0.9, 1.2, 1.3, 1.45],
'time': [1, 2, 4, 3, 5, 7, 6, 8],
'rank_x': [1, 2, 3, 4, 5, 6, 7, 8],
'event': [1, 1, 1, 1, 1, 1, 1, 0],
}).sort_values('time')
np.random.seed(42)
sample_df['time2'] = sorted(np.random.randint(low=-42, high=888, size=8))
sample_df

Their fits are indeed identical:
sample_cox_model = CoxPHFitter()
sample_cox_model.fit(
sample_df,
duration_col='time',
event_col='event',
formula='x'
)
sample_cox_model.print_summary()

sample_cox_model = CoxPHFitter()
sample_cox_model.fit(
sample_df,
duration_col='time2',
event_col='event',
formula='x'
)
sample_cox_model.print_summary()

Conclusion
What did we learn from all of this?
- Perfect predictors in survival models are those predictors whose ranks perfectly match the ranks of event times.
- Cox regression cannot fit these perfect predictors with a finite coefficient β, leading to wide confidence intervals and big p-values.
- The actual values of event times don't really matter – it's all about their ranks.
- When the ranks of event times and predictors don't align, we don't get that handy cancellation effect for large β values in likelihoods. So, we need at least one case where the ranks don't match to have a fit with a finite coefficient.
- Even if we try some fancy regularization techniques, perfect predictors can still give us those annoyingly wide confidence intervals and high p-values in real-life situations.
- Just like in logistic regression, if we don't really care about those p-values, using a regularization method can still provide us with a handy model fit that gets the prediction right!
If you would like to run the code yourself, feel free to use IPython notebooks from my Github: https://github.com/igor-sb/blog/blob/main/posts/cox_perfect.ipynb
Farewell until the next post!