
HANDS-ON COURSE: EPIDEMIOLOGY |
WHAT’S WRONG WITH CORRELATION ANALYSIS? NOTHING! UNLESS… |
Friedo Dekker, Leiden, the Netherlands
|
|
Dr.
F. Dekker
Afdeling Klinische Epidemiology LUMC Leiden, The Netherlands |
Slide 1
Thank you Mr Chairman, Ladies and Gentlemen. I’ve got you some time to think about this title, what’s wrong to contemplate for yourself about the answer to this question. I gave away some part of the message of today that’s nothing but I want to make you feel much comfortable so there’s unless always to this sort of question.
Slide 2
Let’s start thinking about correlation. I’ll give you an example, a very well known example, the correlation estimated here for MRRD and Cockcroft-Gault formula. Now what do we say would be the correlation more or less? How much is it? O.4? 0.9? Someone else. 0.7.
Slide 3
Well, why are there so many different answers to this? Because I guess all these answers are right, can be right. Why is that? It’s because it depends and the correlation, the coefficient depends on a number of things. I want to explore with you the factors which influence the correlation coefficient.
Slide 4
So within this framework, this example, do you think sample size is important for the correlation coefficient? Is it so that if you have a bigger sample, then your correlation will increase? What do you think?
Well, this is how you can see it. Here are only 10 patients with a correlation and here are some hundred patients, it’s the same shape more patients but the correlation is the same. I’m sorry the correlation itself is not influenced by the sample size.
So what about statistical significance? Is the significance, does it influence the size of the correlation coefficient? No, that’s the correct answer because if I have a correlation coefficient and I know the number of patients, then I know by definition the p value, the significance. So the significance is the result of a correlation coefficient in a certain context with a certain number of patients. So it’s not influenced by the significance itself.
Now what about measurement error? Random error, variability, does that influence the correlation coefficient? If you design a study and have a measurement and I have a measurement, which is imprecise, it has a lot of random variation in itself, does that influence a correlation coefficient? It does right? Because if I would have a measure which is only random error, only random, then it cannot correlate to anything alright? So it does depend on measurement error.
Slide 5
Now what about the mean GFR in a population? If I have 2 populations, one with a GFR of 20 and one with a mean GFR of 60. Will the correlation be different there? Well, it could be but not in this example because this is just the same shape moved a little bit to this way. So I have a higher mean compared to here and as well on the other, on the y axis but the correlation is exactly the same. So if I have more patients, the correlation is no different, if I have a different population with a higher mean value, then it’s also the same correlation. It will be different perhaps if the range of the GFR is different if I have one population with a range between 10 and 30, another population with a range between 10 and 60. There the correlation is different and I can show it here because here I have this correlation in a small range and now I have an extended range, a population with an extended range and then the correlation is much higher.
Slide 6
So the range I study on the x axis and on the y axis does influence the size of the correlation coefficient. So that is a bit difficult. Now this range of the GFR how is that influenced, by what mainly? Who determines or what determines the range of the GFR in your population? You do as a researcher. As a researcher you choose your population, so you choose whether you study this range or you study this range. So the correlation coefficient is in your own hands, you can determine it. If you want a high correlation, just make sure you have a high range.
Slide 7
Now what about systematic difference or bias? Does that influence in itself the correlation coefficient? That’s also a bit difficult. I have here two situations, here a certain correlation. This is just a line of identity, the same line of identity. Here the cloud has moved upwards, so there’s a systematic bias. This variable has a higher mean GFR than this measurement. So systematic difference or bias.
Slide 8
What about the correlation? It is exactly the same. So the correlation is not influenced by bias. Now, perhaps the most difficult one, the slope relationship, the slope of the line. Will that influence the correlation? This correlation and this are exactly the same.
Slide 9
So now we have some feeling about this correlation coefficient, how it’s influenced, by what factors. Now, this is the background to think about what is a correlation coefficient? It is something which ranges between –1 and +1. It can be 0. It’s a measure of linear association. It’s the extent to which data are on a straight line. On a straight line. I don’t know which line at the moment and that perhaps is the problem a little bit. It can be on the line shifted upwards or with a different slope. But as long as it’s on a straight line then the correlation, be most of the time, will be the same. As some other thing you can square the correlation coefficient and then you get something which is called explained variance. If all the data are perfectly on a straight line then all the variation on the one axis is reflected in variation on the other axis. So 100% of the variation in a y-axis is explained by the variation on the x-axis. So that is explained variance and you see it in many studies reported.
Slide 10
Now this correlation coefficient as we feel it has something to do with the shape of this cloud of points on these axis. Something to do with the length so to say and the width or the tightness. It’s a kind of ratio between how long it is and how thick it is. It’s not a real ratio but it has something to do with it. The longer it is the higher the correlation. We saw that already with the range which influences it.
Slide 11
Now again for you what is the correlation of this cloud? This is 1, very well and this is of course 0 no correlation, so this is somewhere in between. Well I’m not so sure it’s just an estimation on the Internet there are kinds of quizzes and you can test yourself and you get this kind of cloud and you can test your knowledge about what is the correlation. Perhaps it’s 0.5 perhaps 0.7 somewhat higher. What about this? It’s minus, minus the same. Ok.
Slide 12
What about this correlation? It is a straight line right, so it could be one but it’s not 1 it’s no correlation at all because if this varies, the other measure this does not vary along with it. So the correlation here is 0 and the same also for this one again the correlation is just 0. So it depends a little bit.
Slide 13
Now what is a correlation coefficient? How do you calculate it? You’d never calculate it by hand the computer does it for you. But how does the computer do it more or less? Because if you understand it a little bit, then you can understand why it has these properties. So the first thing you do is you transform. So again this is just my data values and you transform it, you shift it a little bit or you shift the axis and now the mean of the x variable is 0 and the mean of the y variable is 0. So you just move it a little bit so that the means are 0. That’s the first step. Then you do something else, you compress it. You compress it in such a way that the variants or the spread on the x-axis is the same as on the y-axis. So if the slope was a little bit flat, then it’s 45 degrees. So these are the 2 basic steps before the computer does anything for you. He just moves it towards a mean of 0 and he shifts, tilts it a little bit to get equal spread.
Slide 14
Then you can see I have 4 parts. I have both variables are plus here, both are minus. Then what you do is you calculate cross-products. So for each value you know the x and the y and you just calculate, you multiply them. If you have many variables, many points over here, then you have many positive values. If you have many over here, again you have many positive values. So if you add them up all those values, then you end up with a high sum of those products and you divide by the number of people, that’s not interesting and then you get a high, positive correlation coefficient. Well, if you would have many patients over here, then one is plus the other minus then you’ve got a negative result and the same here. So if the shape is like this, then you’ve got a negative correlation. It’s just because of these products and the trick is that you transformed it, you shifted it so it will work this way. The only thing is it will work only when both variables are transformed or standardised so to say towards a mean of 0 and an equal standard deviation of variants and then you can just do it like this.
Slide 15
Ok now we understand why it’s the shape which determines the correlation. We understand why if you have a bigger spread, then you have a higher correlation coefficient. If you shift it up or shift it down, then the correlation will stay the same. If you do a little bit like this or a little bit like that, the correlation will be basically the same because it’s transformed.
Now what is wrong with this correlation analysis? What can be wrong? The first thing is that as I’ve said already the size of the coefficient is determined by the researcher and that’s a problem but it’s also a very nice feature of it because most of the time we want high correlations. We want significant results to get published, so we want high correlations so we can determine it by just making sure we have a large range from adults, from children to adults and so on.
Slide 16
When I measure agreement like here with the MDRD and the Cockcroft-Gault then the correlation is not influenced if there’s a mean difference. I’ll show that and also not if the slope is different. So that does not influence this correlation. So those are two aspects as what you can do as a researcher and what the data does for you and they influence the correlation in a different way.
Slide 17
Now this is the summary of this part. All these three clouds have exactly the same correlation.
Slide 18
Whether it’s shifted or tilted, it’s all the same correlation. That’s one of the big problems of it. The correlation does tell us to what extent the data is on a straight line but not on what straight line. Most of the time that’s what we want to know. The other thing is that if you have a high correlation between 2 GFR estimations, you don’t know in clinical terms what error you make. You know it’s a correlation 0.90 or 0.50 or whatever but you don’t what is the difference, what is the problem I have when I substitute one measure with the other. It’s not a clinical term it’s a kind of black book this correlation and I want to know as a clinician how the 2 measurements can be apart.
Slide 19
So this is a study published a couple of years ago in Kidney International. There are many studies like this. This is the Cockcroft-Gault in children and in adults. 2 subpopulations in the same paper Cockcroft-Gault against Inulin clearance. Now what correlation here is highest, in children or in adults? It’s the same. Perhaps here it’s somewhat tighter than over here. If you look it up in the discussion, the authors themselves say in children the correlations were better and they provide us with the numbers as well and it’s 0.81 in the children and 0.67 in the adults and that’s because this is somewhat tighter perhaps than here.
Slide 20
So their correlation is somewhat higher not very much different, just a little bit. So the correlation is higher. How could that be? Could it be real biology that’s a bit higher in children than in adults? Could it be something else? So why is it different? If you look at the range density in the inulin clearance in children it’s between 50 and 140 if you omit these 2 the majority and if you do it in the adults, it’s between 25 and only 90. So the range is different. So remember if the range is smaller, the correlation can be smaller.
Slide 21
So I have to compare it on the same scale so to say. So if I do it like this, then it’s between 20 and 180, between 20 here and 180 or so. So these are the correlations, you understand why this one now the tightness is the same, the length is larger here so you understand why it’s higher than over here but you’ve got the impression that if you would extend this range, if you include more adult with higher GFRs, then perhaps the correlation would be as big as in the children. So it’s a bit of a strange conclusion here, if you don’t realise that this population is so different.
Slide 22
Now what they also did in the same paper very nicely is compare the Cockcroft-Gault and the Schwartz in the children because they are two formulas and which one should we use? In the correlation of the Cockcroft-Gault we saw already 0.81 in the Schwartz it’s even higher 0.87. So the correlation is higher here and the range is exactly the same because it’s just the same patients. So that’s not the trick. So the correlation is higher, the range is the same. The number of people is the same. So would you say the agreement is better in the Schwartz than in the Cockcroft-Gault? It looks like it but in clinical terms it’s a little bit difficult. Should we use this one instead of this one? What is the expected error we could make? If I draw the line of identity in red, then you see that at a mean value here, the mean error there’s no bias here for the Cockcroft-Gault, the mean error, the mean Cockcroft-Gault is exactly the same as the mean inulin clearance, so as a mean value it’s ok. I have some over, under estimation depending on where I am but for the Schwartz, there is a mean difference. The Schwartz tends to overestimate on a mean value the inulin clearance a little bit, it’s higher. Again the slope is a bit different but still we can see that still it’s difficult to see what error we are expecting to make if you use this Schwartz instead of the inulin clearance.
Slide 23
So therefore, two British Statisticians 20 years ago, 1986 published a landmark paper in the Lancet about agreements, how to analyse it and they said the agreement is difficult to see from a scatterplot. It’s somewhat better if you draw the regression line or the formula and it’s even better to plot the difference between the 2 against the mean value and I’ll show you in a minute. If you do that, then you can in a clinical way present the limits of agreement so what is the range in which between 95% of the differences are expected to be. That is not something like a confidence interval because a confidence interval gets smaller, if you get more patients. We know that the more patients you have, the smaller your confidence interval is and as another case here I just want to know the error I make so therefore it’s standard deviation, it’s a little bit different.
Slide 24
What do we do? This is again the Schwartz, 0.87, we have already seen it. What you do is you swap it so the red line is laid horizontal, so here you have the difference between the Schwartz and the inulin clearance and it should be 0 but it’s not, so there’s a mean overestimation of the Schwartz of 20 and you see here is the range and I take the mean value here. The mean of the inulin clearance in Schwartz, it’s a bit statistical why you do the mean instead of one or the other but this is the best you can demonstrate that. So here you see the mean difference is 20 and you see 95, these are all the points, all the patients and you 95% of the differences are between 20 and minus 60. So if I use the Schwartz instead of the inulin clearance, I can be either 20 too high or 60 too low in this range. Now as a clinician, now we have a feeling about the size of the error we make plus 20 or minus 60 that’s quite a lot I guess for patients over here. It can be 60 lower than 80, then it’s quite a difference. So the correlation was very high impressive 0.87 so almost 90 so we would say there’s perfect agreement and you see it like this, you say nice agreement but on a clinical scale you see the difference between the two can be quite large and you also see that the difference tends to be larger at higher values, the same as you saw a little bit over there.
Slide 25
Ok, so you see the mean difference, you see the limits of agreement, they call it in clinical terms and you see that the difference gets larger at higher values and now you can compare again the Cockcroft-Gault we saw already 0.81 and the Schwartz 0.87 because for the Schwartz we just saw in the previous slide the mean difference of 20 and you see for the Cockcroft-Gault the mean difference is 0. What about the limits of agreement? It is from here from minus 45 to plus 45, so a range of 90 and the range here is from minus 60 to + 20, so it’s 80 so more or less the same.
Slide 26
So now I’m not so sure again that the Schwartz is really better than the Cockcroft-Gault because the mean is better over here and the range is more or less the same. So now we know what we’re talking about. So when you assess agreements between measurements use this kind of Bland-Altman plot instead of correlations. This is easier to clinically interpret the expected difference.
Slide 27
So to end up what’s wrong with correlation analysis? Nothing is wrong unless you do it not in a correct way so be careful. Thank you very much.