top2006

HANDS-ON COURSE: EPIDEMIOLOGY

CONFOUNDING. STRATIFICATION AND MULTIVARIATE ANALYSIS MADE SIMPLE

Kitty Jager, Amsterdam, the Netherlands

 

jager

Dr. K.J. Jager
ERA-EDTA Registry - Department of Medical Informatics,
Academic Medical Center, University of Amsterdam,
Amsterdam, The Netherlands

Slide 1

jagerslide

So hello everyone.

Slide 2

jagerslide

It’s many times that in clinical journals you see things like after correction for diabetes, an independent effect of a particular factor, the real effect of a particular factor, plots that are adjusted for factors and you have the idea, if you read this that what the researchers want is to get rid of something. What they want to get rid of is confounding.

Slide 3

jagerslide

What is confounding? Confounding is a mixing of effects. In what way a mixing of effects? It’s when researchers look at a particular relationship, which I defined as a factor that patients are exposed to and the outcome is then a particular disease. They would like to know whether there is some other factor that is confounding this relationship.

In most analyses confounding is something that you do not want because what you want to do is to find true associations not associations that are mixed up by something else.

Slide 4

jagerslide

In order to be a confounder a factor has to satisfy three conditions. First of all, it should be a risk factor for the disease. The second is, it should be associated with the exposure. These two things should be associated. And three, this confounder should not be an effect of the exposure.

Slide 5

jagerslide

I will give you an example. Suppose we are studying the relationship between grey hair and death and we would wonder whether age is a confounder in this relationship. Then we would ask ourselves first is age a risk factor for death? Well, I think we all agree it is, unfortunately. The second thing we look at is the relationship between the confounder and the exposure, grey hair. Is there a relationship between age and grey hair? Yes there is. People who are older more often have grey hair. However, we also want to know is age an effect of grey hair? No it’s not. So now we can safely decide that age is confounding in the relationship between grey hair and death.

Slide 6

jagerslide

How to prevent or to control confounding? There are two ways. First of all, you can try to prevent it during your study design and secondly during your data analysis you can try to control for confounding. The first way to prevent confounding is through randomisation. If you’re doing an experiment and you have a random assignment to study groups then there’s chance that determines the effect of people coming to one of the groups so that’s one way.
The second way is to restrict your group. If you think that sex could be a confounder in your relationship, it’s very simple to investigate only females.

A third method you can use is matching. You can take a case and you make sure that all the controls do not defer to your case with respect to confounders. For the controlling you can do 2 things, stratification and multivariate analysis. I will try to explain first to you what stratification is.

Slide 7

jagerslide

So the aim of stratification is to control for confounding. What you want to do is to derive an adjusted estimate of the effect of a variable. Suppose that in a particular relationship you think that there is confounding by age. You also know that there’s a huge variation in age from early 30 to somewhere in the 70s. So then you have a problem.

Slide 8

jagerslide

The first step you need to make then is to create strata, subgroups. You make subgroups in which the confounder does not vary very much. So this is 35-44 etc. etc. So in each of these strata you make comparisons to calculate the effect for that specific strata like this. In each strata you calculate the effect.

Slide 9

jagerslide

What you then need to do is to aggregate the information over all the strata to calculate the overall adjusted effect size. So you take them all together and there are 2 ways of doing that. There’s pooling or standardisation. What they both do is calculate some weighted average.

Slide 10

jagerslide

And pooling does this by complicated statistical formulas but don’t worry, you don’t need to do this yourself, this is just one of them for the odds ratio and all statistical packages will do this for you, so you don’t need to do this yourself fortunately. But I will explain to you very briefly what they do.

As I said, if you have a specific crude effect that is still confounded by age, for example, an effect of 2, you then calculate the effects in each of your strata and then find out that these are different, 2.4, 2.3, 2.4 again and what the formula then does for you it calculates a weighted average and then after pooling it turns out that the adjusted effect of your factor is 2.3.

Slide 11

jagerslide

The other method of aggregating information is standardisation. Also I suggest that standardisation takes a weighted average in order to calculate an adjusted effect size. Standardisation is mainly used to compare data from different populations, for example, different countries.
There are two ways of standardisation. First of all there’s direct standardisation and in that case you use weights from an external standard population. An example is the European Standard population, for example, if we as a European registry want to calculate adjusted incidences or prevalences, what we do is we take the European Standard population as the reference and in that way we adjust for, for example, the effect of age.

There’s also something called indirect standardisation and that’s a little bit different. Indirect standardisation uses the exposed group as the standard population.

Slide 12

jagerslide

An example of direct standardisation. In this case it is direct standardisation of a disease rate for diabetes mellitus and if we, for example, know that in our study population the rate of a specific disease in diabetics is 10 per thousand patient years and in non-diabetics it is 5 per thousand years. We also know that in the European Standard population diabetes mellitus is 5%. Then it’s very simple to standardise. We take 5% of the rate in diabetics, we take 95, the other part, 95% of the rate in non-diabetics, we add them up and then we’ve got something like 5.25/1000 years. That’s a rate that is adjusted for diabetes.

I will not go into an example of indirect standardisation, for example, the standard mortality ratio is an example of indirect standardisation but that would go too far for this session.

Slide 13

jagerslide

So if we have made strata, we should realise that a stratified analysis controls confounding only between the strata. When there are relatively few strata, only 2 or 3 in the case of continuous variables, for example age, there maybe residual confounding within those strata that you will not get rid of all confounding in that way. Then we could say ok that’s very simple. The only thing we need to do is make more strata and then we will make a better adjustment for confounding. However, there’s also a problem with that because if you make more strata to avoid this residual confounding, you may end up with too few events within the groups and that may lead to imprecise results. So you always need to find a balance between the two.

Slide 14

jagerslide

There’s something else. In this case that I showed there was stratification only for one variable, for age in this case. Stratification is a very effective means to control for confounding. However, if you have many categories, for example, if you want to stratify by gender and by 5 age categories, you already have 10 strata because they are multiplied. If you have 5 variables with 3 categories you end up with 243 strata so then you have a problem, it’s not very simple to do that. In that case stratified analysis is not such a practical method, you’d much better do multivariate analysis.

Slide 15

jagerslide

So now I will try to explain some more about multivariate analysis.

Slide 16

jagerslide

What we mostly do is regression analysis and regression describes the relationship between two variables, by predicting one variable, for example a disease, from a known value of another variable. You should realise that in all these papers that we read people use specific terms and synonyms. For example, the predictor variable is also called independent variable or explanatory variable and the outcome is sometimes called dependent variable or response variable, so these are all synonyms.

Slide 17

jagerslide

There are different types of regression analysis and it does not really matter what the predictor variables are, I mean you can adjust for age or for sex and age can be a continuous variable, sex is of course a binary variable. So with regards to predictor variables that is not very important but the type of regression analysis is determined by the outcome variable. If the outcome variable is continuous, for example, haemoglobin, the type of regression analysis that you should use is linear regression. If it is categorical, for example, whether a patient is compliant or not, then you should use logistic regression and when you want to do a time to event analysis, for example, the time to death or the time to myocardial infarction then you should use Cox proportional hazards regression.

Slide 18

jagerslide

This is the most simple mathematical equation for linear regression. It is the outcome equals a constant plus a regression coefficient times the value for your predictor variable. So here this is the disease that you are predicting or also called the dependent or the outcome or the response variable equals a constant plus a regression coefficient times the exposure, independent predictor, explanatory variable whatever you call it.

Slide 19

jagerslide

So what linear regression tries to do it determines a linear regression line that best describes the straight relationship between the x variable, the predictor variable and the outcome variable. Why? What it does is that it estimates the average values for y according to the different values of x.

Slide 20

jagerslide

So I just said this is the regression line and this is the mathematical equation. What does this mathematical equation mean in statistical terms? Well, if you look at this regression line, we can say that a, this constant which is also called the intercept that is the value of y when x is 0.

b is the slope, the regression coefficient and that is the mean increase in y for 1 unit of increase in x. Suppose then that for some particular reason we would like to study the relationship between starting on peritoneal dialysis and the percentage of body fat of a particular patient. Well if we do that in SPSS we would have this output. I just took a study population for this session and I calculated this relationship. What you get out of it is you get a model summary, you get an ANOVA table and you get a table of coefficients.

Slide 21

jagerslide

What do these mean? Well, the ANOVA table tells you whether this model is a significant model. The model summary will tell you something about the R Square that is the explained variation that Friedo was just talking about which gives you feel for the goodness of fit of the model.
Hopefully this R Square is high. In this particular relationship it’s very low but this is just for the sake of explaining confounding. So only 5% of the variation in body fat is explained by the fact that you start with PD.

This is a very important table. This is the table that relates to your mathematical equation. Here it tells you that the percentage of body fat equals the constant 23 plus 3.979 in case you use PD. So for patients who start on PD it is 23 plus 4 so 27. The mean body fat is 27%. If they start on hemodialysis or not PD, it’s only 23%. Please remember this 3.9% because we will use it later.

Slide 22

jagerslide

However, suppose you would ask yourself I have this relationship between PD and body fat, could age be a confounder? Then again we should ask ourselves is age a risk factor for the percentage of body fat? I think it is.
Do you think that there is an association between PD and age? Any suggestions? Would there be an association between PD and age, between starting on PD and age?

Well, in most countries the younger the patients that start on PD are much younger than the patients that start on hemodialysis so there is indeed a relationship, depending on your country, there is a relationship between age and peritoneal dialysis but we should also satisfy the third criterion is age an effect of PD? No it’s not. So we may decide that for a particular population age maybe confounding the relationship between peritoneal dialysis on the one hand and percentage of body fat on the other hand.

Slide 23

jagerslide

Then we think we should do multivariate analysis. Again this is the very simple equation. Simple univariate regression. What if we want to do a multivariate regression? Again the outcome is y, again there’s a constant and the only thing that happens is that you put in a factor together with its regression coefficient, one factor, the second factor here with its own regression coefficient etc. etc. and you may go on and on as long as you like. Well, not too long but there are specific rules for that but in principle you can include many factors in such a model.

So this is what a regression makes multivariate, the number of regression coefficients and factors. What multivariate regression does is that it provides estimates of effects that are mutually unconfounded, so these are adjusted for each other. These things that we have here we call them partial regression coefficients I will show that to you in a minute. For example, this b1 here, as we saw in the figure of the regression line, this b1 represents the amount by which y increases on average. If we increase this factor by 1 unit but keep all other Xs constant, so in this way you can control or adjust for them. What we then say is that b1 represents the effect of this factor x1 on y that is independent of all other Xs.

Slide 24

jagerslide

Ok we know that and we’ve put it again in this SPSS. What we see now are just the same tables. Again we can see that the model is significant here. We have an R Square that has increased a bit so it’s still only 10% but it has increased and now we look at a table of coefficients. Now the formula is just extended. The percentage of body fat is the constant plus 3.236, if the patient uses PD plus a specific regression coefficient for each year that a patient is older.

Then look again at this regression coefficient. You may remember that the other one was 3.9 and now it says 3.2. Do you think that age is indeed a confounder in this relationship? Any views why? Well, you can see that the regression coefficient changed and it changed by approximately 20% and there are not really very strict rules but what most books tell you is if a regression coefficient changes by more than 10%, you may decide to keep this new variable in your model because you can then decide that it is indeed confounding your relationship, so we keep it in.

Slide 25

jagerslide

As I said adjusting for confounding can be done by stratification and multivariate analysis. But you can also measure confounding, for example, by doing multivariate analysis and that is just by looking at the difference between the crude and the adjusted measures of effects, what we just did the regression coefficients. If these are almost equal, then there’s no confounding. What you should not do is look at that particular b value that is given for these regression coefficients. That is because remember the triangles. Confounding does not depend on statistical significance, it does depend on the strength of the relationships, the strengths of the associations between the confounder on the one hand and the exposure and the disease on the other hand.

Slide 26

jagerslide

Thank you.