Suppose we omit z from the regression, and suppose the relation between x and z is given by. Substituting the second equation into the first gives.
These differ if both c and f are non-zero. As an example, consider a linear model of the form. If the independent variable z is omitted from the regression, then the estimated values of the response parameters of the other independent variables will be given by the usual least squares calculation,.
On taking expectations, the contribution of the final term is zero; this follows from the assumption that U is uncorrelated with the regressors X.
On simplifying the remaining terms:. Note that the bias is equal to the weighted portion of z i which is "explained" by x i.
The Gauss—Markov theorem states that regression models which fulfill the classical linear regression model assumptions provide the most efficient , linear and unbiased estimators.
In ordinary least squares , the relevant assumption of the classical linear regression model is that the error term is uncorrelated with the regressors.
The presence of omitted-variable bias violates this particular assumption. I discuss implications of the causal structure for bias, and provide brief illustrative examples.
The paper addresses additional issues in missing data, and concludes with a brief discussion. Before proceeding, it will be useful to review the stamdard definitions of three types of missingness missingness completely at random, at random, and not at random as well as the definition of complete case analysis.
Data are missing completely at random MCAR , when the probability of missingness depends on values of neither observed nor unobserved data.
Data are missing at random MAR when the probability of missingness depends only on observed data. Data are missing not at random MNAR; alternately, there are non-ignorable missing data or non-random missingness when the probability of missingness pattern depends in part on unobserved data.
Collider bias or collider-stratification bias, or collider-conditioning bias 2 , 3 , 7 is bias resulting from conditioning on a common effect of at least two causes.
In Figure 1 , attendance at clinic C is an effect of both exposure E and disease D. This association is represented by a dotted line in Figure 1B.
First, collider stratification is usually though by no means always explained in a situation in which exposure and disease are marginally independent; it is important to note that stratification on a collider can also introduce bias when exposure and disease are not independent.
Second, while some explanations of collider bias emphasize stratification, today we understand that similar biases are introduced by any form of conditioning, including restriction and stratification on colliders.
While an apparently minor point, this recognition gives us a key pivot for moving from selection bias to missing data. Restriction to a single level of a collider C is strongly analogous to restricting data to persons who are not missing.
If the study is conducted at a antenatal care clinic, then both pregnancy and a new diagnosis of AIDS may affect presence at the clinic, and conduct of the study in that setting may lead to a biased estimate of the relationship between pregnancy and time to AIDS.
Figure 2 shows a causal structure in which neither E nor D has any causal effect on C. Thus, conditioning on C — or restricting to a level of C — is equivalent to taking a simple random sample of the original cohort.
From a selection-bias perspective, this obviously will introduce no bias; from a missing-data perspective, this is equivalent to data missing completely at random.
Neither E nor D affects factor C, so conditioning on or restricting to a level of C amounts to simple random sampling. Table 1 shows the hypothetical cohort of patients we would have observed if we had studied the effect of E on D in for example a population sampled at random from the total eligible population, including some who attended clinic and some who did not.
Because C is unaffected by E or D, this is equivalent to simple random sampling; we observe a fixed proportion of individuals regardless of values of E and D in this case, some fraction f.
In this case, conditioning on clinic attendance amounts to a simple random sample of size f N from the original N subjects, repeated independently for every combination of E and D.
As can be readily seen in Table 2 , all measures are unbiased. Clinic attendance might be influenced by various additional factors e. Independence of these additional factors and both E and D is sufficient but not necessary for lack of bias when conditioning on C.
If attendance at our clinic is due only to distance of home from the clinic, and not due to pregnancy status nor to AIDS diagnosis, directly or indirectly , then analyses of these women will be unbiased.
Figure 3 shows a case in which exposure E is the only cause of C. From a selection-bias perspective, restricting on C will amount to simple random sampling within level of exposure; from a missing data perspective, data are missing at random, or completely at random within level of exposure.
As can be ascertained from Table 3 , a crude estimate of exposure or disease prevalence will in general be biased under these conditions: However, because data are missing completely at random within exposure category, the risk by exposure status can be calculated without bias: In consequence, all contrasts of risks, including risk differences, risk ratios, and odds ratios are unbiased in this setting.
E, but not D, affects factor C, so conditioning on or restricting to a level of C amounts to simple random sampling within level of E. However, in real-data analysis it is almost never the case that the causal diagram is as simple as Figure 3 ; with more complications, it is less likely that this condition will hold.
For example, if we add to Figure 3 a third variable F that causes both C and the D, C is a collider for E and F; then, conditioning on C creates bias of the E-D relationship via F as Figure in the book by Rothman and colleagues Assume our clinic does not provide extensive antenatal care beyond antiretroviral therapy, and so attendance at our clinic is lower among women after they become pregnant.
If attendance is not affected by AIDS diagnosis or any other factors, then a contrast of risk of AIDS comparing pregnant and non-pregnant women attending our clinic will be unbiased.
Figure 4 shows a case in which disease status D is the only cause of C. Conditioning on C leads to simple random sampling within level of the outcome Table 4.
As with Figure 3 , the causal structure in Figure 4 leads to biased estimates of prevalence; but in addition, this structure leads to biased estimates of risk.
D, but not E, affects factor C, so conditioning on or restricting to a level of C amounts to simple random sampling within level of D.
In such a case-control study, the case-control odds ratio provides an unbiased estimate of the cohort odds ratio; this is true in Table 4 , as well.
Just as in such a case-control study, we are unable to directly estimate absolute risks, risk differences, or risk ratios without additional information e.
Thus if outcome status is the sole direct cause of selection into a study or analysis, or of missing data, the study is analogous to a case-control study under a particular control-sampling scheme; The cohort odds ratio will be unbiased in complete case analysis — assuming no additional variables of interest as in previous examples.
However, when the true effect of an exposure on the outcome is null, then missingness will not be introduced into the risk difference and risk ratio.
Assume that women are more likely to miss clinic visits if they become seriously ill, and so attendance in clinic is affected by AIDS status.
If attendance at clinic is not affected by pregnancy status or any other factors and there is a non-null association between pregnancy and time to AIDS, then the risk difference and risk ratio for AIDS comparing pregnant and non-pregnant women will generally be biased, while an odds ratio for AIDS comparing pregnant and non-pregnant women will be generally unbiased.
One critical special case is when E and D are non-interacting: A distinction of sampling bias albeit not a universally accepted one is that it undermines the external validity of a test the ability of its results to be generalized to the rest of the population , while selection bias mainly addresses internal validity for differences or similarities found in the sample at hand.
In this sense, errors occurring in the process of gathering the sample or cohort cause sampling bias, while errors in any process thereafter cause selection bias.
It is closely related to the survivorship bias , where only the subjects that "survived" a process are included in the analysis or the failure bias , where only the subjects that "failed" a process are included.
It includes dropout , nonresponse lower response rate , withdrawal and protocol deviators. For example, in a test of a dieting program, the researcher may simply reject everyone who drops out of the trial, but most of those who drop out are those for whom it was not working.
Different loss of subjects in intervention and comparison group may change the characteristics of these groups and outcomes irrespective of the studied intervention.
Data are filtered not only by study design and measurement, but by the necessary precondition that there has to be someone doing a study. In situations where the existence of the observer or the study is correlated with the data, observation selection effects occur, and anthropic reasoning is required.
An example is the past impact event record of Earth: Hence there is a potential bias in the impact record of Earth.
In the general case, selection biases cannot be overcome with statistical analysis of existing data alone, though Heckman correction may be used in special cases.
An assessment of the degree of selection bias can be made by examining correlations between exogenous background variables and a treatment indicator.
However, in regression models, it is correlation between unobserved determinants of the outcome and unobserved determinants of selection into the sample which bias estimates, and this correlation between unobservables cannot be directly assessed by the observed determinants of treatment.
From Wikipedia, the free encyclopedia. Bias in a statistical analysis due to non-random selection.