
HANDS-ON COURSE: EPIDEMIOLOGY |
CONFOUNDING. STRATIFICATION AND MULTIVARIATE ANALYSIS MADE SIMPLE |
Kitty Jager, Amsterdam, the Netherlands
|
|
Dr.
K.J. Jager
ERA-EDTA Registry - Department of Medical Informatics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands |
Slide 1
So hello everyone.
Slide 2
It’s many times that in clinical journals you see things like after correction for diabetes, an independent effect of a particular factor, the real effect of a particular factor, plots that are adjusted for factors and you have the idea, if you read this that what the researchers want is to get rid of something. What they want to get rid of is confounding.
Slide 3
What is confounding? Confounding is a mixing of effects. In what way a mixing of effects? It’s when researchers look at a particular relationship, which I defined as a factor that patients are exposed to and the outcome is then a particular disease. They would like to know whether there is some other factor that is confounding this relationship.
Slide 4
In order to be a confounder a factor has to satisfy three conditions. First of all, it should be a risk factor for the disease. The second is, it should be associated with the exposure. These two things should be associated. And three, this confounder should not be an effect of the exposure.
Slide 5
I will give you an example. Suppose we are studying the relationship between grey hair and death and we would wonder whether age is a confounder in this relationship. Then we would ask ourselves first is age a risk factor for death? Well, I think we all agree it is, unfortunately. The second thing we look at is the relationship between the confounder and the exposure, grey hair. Is there a relationship between age and grey hair? Yes there is. People who are older more often have grey hair. However, we also want to know is age an effect of grey hair? No it’s not. So now we can safely decide that age is confounding in the relationship between grey hair and death.
Slide 6
How to prevent or to control confounding? There are two ways. First of all, you can try to prevent it during your study design and secondly during your data analysis you can try to control for confounding. The first way to prevent confounding is through randomisation. If you’re doing an experiment and you have a random assignment to study groups then there’s chance that determines the effect of people coming to one of the groups so that’s one way.
The second way is to restrict your group. If you think that sex could be a confounder in your relationship, it’s very simple to investigate only females.
Slide 7
So the aim of stratification is to control for confounding. What you want to do is to derive an adjusted estimate of the effect of a variable. Suppose that in a particular relationship you think that there is confounding by age. You also know that there’s a huge variation in age from early 30 to somewhere in the 70s. So then you have a problem.
Slide 8
The first step you need to make then is to create strata, subgroups. You make subgroups in which the confounder does not vary very much. So this is 35-44 etc. etc. So in each of these strata you make comparisons to calculate the effect for that specific strata like this. In each strata you calculate the effect.
Slide 9
What you then need to do is to aggregate the information over all the strata to calculate the overall adjusted effect size. So you take them all together and there are 2 ways of doing that. There’s pooling or standardisation. What they both do is calculate some weighted average.
Slide 10
And pooling does this by complicated statistical formulas but don’t worry, you don’t need to do this yourself, this is just one of them for the odds ratio and all statistical packages will do this for you, so you don’t need to do this yourself fortunately. But I will explain to you very briefly what they do.
Slide 11
The other method of aggregating information is standardisation. Also I suggest that standardisation takes a weighted average in order to calculate an adjusted effect size. Standardisation is mainly used to compare data from different populations, for example, different countries.
There are two ways of standardisation. First of all there’s direct standardisation and in that case you use weights from an external standard population. An example is the European Standard population, for example, if we as a European registry want to calculate adjusted incidences or prevalences, what we do is we take the European Standard population as the reference and in that way we adjust for, for example, the effect of age.
Slide 12
An example of direct standardisation. In this case it is direct standardisation of a disease rate for diabetes mellitus and if we, for example, know that in our study population the rate of a specific disease in diabetics is 10 per thousand patient years and in non-diabetics it is 5 per thousand years. We also know that in the European Standard population diabetes mellitus is 5%. Then it’s very simple to standardise. We take 5% of the rate in diabetics, we take 95, the other part, 95% of the rate in non-diabetics, we add them up and then we’ve got something like 5.25/1000 years. That’s a rate that is adjusted for diabetes.
Slide 13
So if we have made strata, we should realise that a stratified analysis controls confounding only between the strata. When there are relatively few strata, only 2 or 3 in the case of continuous variables, for example age, there maybe residual confounding within those strata that you will not get rid of all confounding in that way. Then we could say ok that’s very simple. The only thing we need to do is make more strata and then we will make a better adjustment for confounding. However, there’s also a problem with that because if you make more strata to avoid this residual confounding, you may end up with too few events within the groups and that may lead to imprecise results. So you always need to find a balance between the two.
Slide 14
There’s something else. In this case that I showed there was stratification only for one variable, for age in this case. Stratification is a very effective means to control for confounding. However, if you have many categories, for example, if you want to stratify by gender and by 5 age categories, you already have 10 strata because they are multiplied. If you have 5 variables with 3 categories you end up with 243 strata so then you have a problem, it’s not very simple to do that. In that case stratified analysis is not such a practical method, you’d much better do multivariate analysis.
Slide 15
So now I will try to explain some more about multivariate analysis.
Slide 16
What we mostly do is regression analysis and regression describes the relationship between two variables, by predicting one variable, for example a disease, from a known value of another variable. You should realise that in all these papers that we read people use specific terms and synonyms. For example, the predictor variable is also called independent variable or explanatory variable and the outcome is sometimes called dependent variable or response variable, so these are all synonyms.
Slide 17
There are different types of regression analysis and it does not really matter what the predictor variables are, I mean you can adjust for age or for sex and age can be a continuous variable, sex is of course a binary variable. So with regards to predictor variables that is not very important but the type of regression analysis is determined by the outcome variable. If the outcome variable is continuous, for example, haemoglobin, the type of regression analysis that you should use is linear regression. If it is categorical, for example, whether a patient is compliant or not, then you should use logistic regression and when you want to do a time to event analysis, for example, the time to death or the time to myocardial infarction then you should use Cox proportional hazards regression.
Slide 18
This is the most simple mathematical equation for linear regression. It is the outcome equals a constant plus a regression coefficient times the value for your predictor variable. So here this is the disease that you are predicting or also called the dependent or the outcome or the response variable equals a constant plus a regression coefficient times the exposure, independent predictor, explanatory variable whatever you call it.
Slide 19
So what linear regression tries to do it determines a linear regression line that best describes the straight relationship between the x variable, the predictor variable and the outcome variable. Why? What it does is that it estimates the average values for y according to the different values of x.
Slide 20
So I just said this is the regression line and this is the mathematical equation. What does this mathematical equation mean in statistical terms? Well, if you look at this regression line, we can say that a, this constant which is also called the intercept that is the value of y when x is 0.
Slide 21
What do these mean? Well, the ANOVA table tells you whether this model is a significant model. The model summary will tell you something about the R Square that is the explained variation that Friedo was just talking about which gives you feel for the goodness of fit of the model.
Hopefully this R Square is high. In this particular relationship it’s very low but this is just for the sake of explaining confounding. So only 5% of the variation in body fat is explained by the fact that you start with PD.
Slide 22
However, suppose you would ask yourself I have this relationship between PD and body fat, could age be a confounder? Then again we should ask ourselves is age a risk factor for the percentage of body fat? I think it is.
Do you think that there is an association between PD and age? Any suggestions? Would there be an association between PD and age, between starting on PD and age?
Slide 23
Then we think we should do multivariate analysis. Again this is the very simple equation. Simple univariate regression. What if we want to do a multivariate regression? Again the outcome is y, again there’s a constant and the only thing that happens is that you put in a factor together with its regression coefficient, one factor, the second factor here with its own regression coefficient etc. etc. and you may go on and on as long as you like. Well, not too long but there are specific rules for that but in principle you can include many factors in such a model.
Slide 24
Ok we know that and we’ve put it again in this SPSS. What we see now are just the same tables. Again we can see that the model is significant here. We have an R Square that has increased a bit so it’s still only 10% but it has increased and now we look at a table of coefficients. Now the formula is just extended. The percentage of body fat is the constant plus 3.236, if the patient uses PD plus a specific regression coefficient for each year that a patient is older.
Slide 25
As I said adjusting for confounding can be done by stratification and multivariate analysis. But you can also measure confounding, for example, by doing multivariate analysis and that is just by looking at the difference between the crude and the adjusted measures of effects, what we just did the regression coefficients. If these are almost equal, then there’s no confounding. What you should not do is look at that particular b value that is given for these regression coefficients. That is because remember the triangles. Confounding does not depend on statistical significance, it does depend on the strength of the relationships, the strengths of the associations between the confounder on the one hand and the exposure and the disease on the other hand.
Slide 26
Thank you.