How do we remove the “healthy vaccinee” bias? Part 3

Eyal Shahar
4 min readSep 2, 2023


Unlike most of my writing, this text is highly technical.

There are two ways to remove the healthy vaccinee bias and both invoke counterfactual reasoning for deconfounding. Nonetheless, the first method relies only on that reasoning; the second is a hybrid of classical conditioning and counterfactual reasoning. I will try to show here that the second method, although likely valid in some circumstances, is unjustified and possibly inferior.

The first method

We compute a corrected risk ratio (RR) in the sample, as shown below:

Since the population denominators, whether counts or person-time, cancel out, the correction may be reduced to counts of deaths alone. The corrected RR = an odds ratio, and the following formula may be used:

Yuan et al. call the numerator and denominator “Covid Excess Mortality Percentage”. Technically, these are odds that compare two categories in a three-value variable (covid death, non-covid death, alive).

Therefore, we can use a 2x2 table to compute that odds ratio. Alternatively, we may use a simple logistic regression model, as noted by Yuan et al.

The correction is simple and valid.

The second method

The second method is a three-step procedure: 1) classical conditioning; 2) counterfactual reasoning; 3) a weighted average.

This sequence is not easily apparent in the multivariable case where the following is done:

First, we fit a multivariable regression model to remove as much confounding as we can, for both Covid death and non-Covid death. Then, we divide the risk ratio (vaccinated vs. unvaccinated) from the model for Covid death by that risk ratio from the model for non-Covid death.

This type of computation was proposed in response to the letter in The New England Journal of Medicine.

Corrected hazard ratio: 0.10/0.23 = 0.43

Over the course of my professional career, I have learned that the logic of regression and multivariable modeling should also hold in the simplest case. To find out what was actually done above (corrected hazard ratio = 0.43), let’s switch from hazard ratios to odds ratios, and from models that contain 13 covariates (some categorical and some continuous) to models that contain only one binary covariate (sex).

Adding the binary sex variable to a logistic regression model that contains vax status is equivalent to the following:

  1. Stratify the sample on sex
  2. Compute the odds ratio (vax vs. unvax) separately for men and women
  3. Assume there is no effect modification by sex
  4. Combine the two estimates by a weighted average

As we know, the classical reason for this procedure is confounding (by sex) and the underlying principle is deconfounding by conditioning (on sex).

All of that is summarized below, for our case of the healthy vaccinee bias. After stratification, we invoke the counterfactual reasonings for each sex, compute a correction for the healthy vaccinee bias for each sex, and combine the estimates by a weighted average.

The procedure seems valid, but why are we doing it?

Why should we prefer the second model below over the first?

This is not the typical case of deconfounding. The coefficient of V from the first model is not “confounded”. The exponentiated coefficient (OR) is a valid estimate of the vax effect after removing the healthy vaccinee bias, namely, after removing confounding bias. It is derived from counterfactual reasoning and does not need to be “corrected”.

If so, what do we gain from the second model? Why do we stratify on sex and then combine the sex-specific estimates? What is the justification for the procedure? Extending the questions to the multivariable case (tables above): Why should we prefer the multivariable correction of the hazard ratio (VE=57%) over a simple correction (VE=0%)?

I have no answers to these questions. Perhaps someone else has. But I know what we may lose from using a multivariable model.

  1. We may mistakenly condition on a collider and add bias
  2. We may lose in the domain of variance (for the coefficient of vax)

Lastly, if that procedure is used, the correct computation calls for a single multinomial regression, where the dependent variable takes three values (covid death, non-covid death, alive), or a model with just two values (covid death, non-covid death.) You can read about that issue in the reference below:

Paul D. Allison. Logistic Regression Using the SAS System: Theory and Application (pages 122–3).

Let me end with a favorite quote on my (inactive) professional website:

“A first principle not formally recognized by scientific methodologists:
when you run into something interesting, drop everything else and study
.” — B.F. Skinner



Eyal Shahar

Professor Emeritus of Public Health (University of Arizona); MD (Tel-Aviv University, Israel); MPH, Epidemiology (University of Minnesota)