Correlation is made up of two words; CO and Relation, but never use the word Relation while explaining Correlation. Now, let's take an example:- Y=f(X); Y is a function of X, so the change in X will bring a change in Y and X is the Cause involved, so the property wherein the cause is involved is known as the causality. and whenever there is a cause involved, it will be a relation. Now, let's take another example:- Ice Cream Sales:- F(Weather) Water Sports Activity:- F (Weather) Now, Ice Cream Sales is a function of Weather and Water Sports Activity is a function of Weather, So in Summers the sales of Water Sports Activity and Ice Cream will increase and in winters, the sales will decrease, So in this there is an Association between the two but there is no cause involved above. NOTE:- whenever there will be a cuase involved, then it will be a Relation and if there is no cause involved in it, then it is called as an Association. Association:- cause will not be there Relation:- Cause has to be there Relation:- It has to be an Association. Association:- may or may not be a Relation. * So while Explaining Correlation, use the keywords Association, Cause/Causality. Never use the keyword Relation while explaining Correlation. Now, Y=F(X) X Y 1 1 2 2 3 3 4 4 5 5 In the above case, as the value of X is increasing the value of Y is increasing, both of them are in the same direction, So this is a case of Monotonic and of course there is an Association between the two. The slope of the above graph will be positive. Y=F(X) X Y 1 5 2 4 3 3 4 2 5 1 In the above case, as the value of X is increasing the Value of Y is decreasing, both of them are in different direction, So this is a case of Non-Monotonic Senario. The slope of the above graph will be negative. Y=F(X) X Y 1 1 2 1 3 1 4 1 5 1 In the above case, as the value of X is increasing the value of Y is not changing, hence there is no association between the two. Now, X Y XMEAN YMEAN X-XMEAN Y-YMEAN X-XMEAN^2 Y-YMEAN^2 X-XMEAN*Y-YMEAN 1 1 3 3 -2 -2 4 4 4 2 2 3 3 -1 -1 1 1 1 3 3 3 3 0 0 0 0 0 4 4 3 3 1 1 1 1 1 5 5 3 3 2 2 4 4 4 Summation of X-XMEAN^2 is 10. Summation of Y-YMEAN^2 is 10 Summation of X-XMEAN*Y-YMEAN is 10. Formula of Correlation is:- Summation of X-XMEAN*Y-YMEAN/Square root of X-XMEAN^2*Y-YMEAN^2 r=10/square root of 10*10 r=10/10= +1 Similarly in case of, X Y 1 5 2 4 3 3 4 2 5 1 The value of r comes out to be -1. and, Similarly in case of X Y 1 1 2 1 3 1 4 1 5 1 The value of r comes out to be 0. Now, there are two types of Correlation:- 1. Pearson Correlation:- This correlation is happening on the exact values/ Absolute values. r is nothing but called as the coefficient of correlation or correlation coefficient. Now, if the value of r is close to 0, then there is less Association. and if the value of r is towards +1 or -1 there is more Association. Let's take an Example. cosider the value of r=0.63 and r=-0.7 So, in the above scenario, r with value -0.7 will be more associated rather than 0.63. Plus and minus will just tell the direction. It will tell you about the positive correlation or negative correlation. Correlation is a perfect example of Bi-variate Analysis as both the variables will be continous in nature. r is written as RHO. Null Hypothesis:- r=0 Alternate Hypothesis:- r not equal to 0. For correlation to Exists, we will have to reject the Null Hypothesis. 2. Ranked Correlation or Spearman Correlation:- This type of Correlation happens when we have the outliers in the dataset and we cannot remove the outliers. This correlation will not happen on the absolute vales rather than it will happen on the rank values or there bucket number. So, for Example we have numbers:- 1 2 3 4 5 6 6000 So, in the above case, 6000 is an outlier and just because of this number, the value of r will be biased. So instead of using these number, we will divide them into buckets and then we will perform the correlation on there bucket number. we are doing this just to reduce the effect of outliers. Syntax:- Proc corr data= Name of the Data Set; with Variable Name; var Variable Name; run; Note:- If we just write Proc corr; run; Then the last generated Data Set will be taken automatically and there will not be any error in this case. Let's take a Data Set to understand more about Correlation:- data fitness; input @1 Name $8. @10 Gender $1. @12 RunTime 5.2 @18 Age 2. @21 Weight 5.2 @27 Oxygen_Consumption 5.2 @33 Run_Pulse 3. @37 Rest_Pulse 2. @40 Maximum_Pulse 3.; Performance=260-round(10*runtime + 2*age + 4*(Gender='F')); datalines; Donna F 8.17 42 68.15 59.57 166 40 172 Gracie F 8.63 38 81.87 60.06 170 48 186 Luanne F 8.65 43 85.84 54.30 156 45 168 Mimi F 8.92 50 70.87 54.63 146 48 155 Chris M 8.95 49 81.42 49.16 180 44 185 Allen M 9.22 38 89.02 49.87 178 55 180 Nancy F 9.40 49 76.32 48.67 186 56 188 Patty F 9.63 52 76.32 45.44 164 48 166 Suzanne F 9.93 57 59.08 50.55 148 49 155 Teresa F 10.00 51 77.91 46.67 162 48 168 Bob M 10.07 40 75.07 45.31 185 62 185 Harriett F 10.08 49 73.37 50.39 168 67 168 Jane F 10.13 44 73.03 50.54 168 45 168 Harold M 10.25 48 91.63 46.77 162 48 164 Sammy M 10.33 54 83.12 51.85 166 50 170 Buffy F 10.47 52 73.71 45.79 186 59 188 Trent M 10.50 52 82.78 47.47 170 53 172 Jackie F 10.60 47 79.15 47.27 162 47 164 Ralph M 10.85 43 81.19 49.09 162 64 170 Jack M 10.95 51 69.63 40.84 168 57 172 Annie F 11.08 51 67.25 45.12 172 48 172 Kate F 11.12 45 66.45 44.75 176 51 176 Carl M 11.17 54 79.38 46.08 156 62 165 Don M 11.37 44 89.47 44.61 178 62 182 Effie F 11.50 48 61.24 47.92 170 52 176 George M 11.63 47 77.45 44.81 176 58 176 Iris F 11.95 40 75.98 45.68 176 70 180 Mark M 12.63 57 73.37 39.41 174 58 176 Steve M 12.88 54 91.63 39.20 168 44 172 Vaughn M 13.08 44 81.42 39.44 174 63 176 William M 14.03 45 87.66 37.39 186 56 192 ; run; proc corr data=fitness; run; Once, we will run the above code there will be two blocks in the results window, 1st will be the simple statistics block and 2nd will be the correlation block and by default it will take pearson correlation. This means that the pearson correlation is by default. There will be a matrix that will be seen in the second block and it will be a 8 by 8 matrix. As there are 8 variables that are continous in nature. There will be two values with each combination, the top value will show you the value of r and the bottom value will show you the probability. The diagonal will have the value as 1. Now, we have to look for the all the combinations where in the value of p is less than alpha. (alpha is nothing but the significance level or the error while performing an experiment and the value of alpha is 0.05). Proc corr data=fitness nosimple; run; The above code will make sure that the Simple Statistics block will not be there in the results window. Now, proc corr data=fitness nosimple; with performance; var runtime age weight Oxygen_consumption run_pulse rest_pulse maximum_pulse; run; The above code will have a matrix of 1 by 7 where in the variable Performance will have the correlation value with the other variables. proc corr data=fitness nosimple rank; with performance; var runtime age weight Oxygen_consumption run_pulse rest_pulse maximum_pulse; run; The above code will make sure that the correlation values will be in order i.e highest correlated value will be at the first place and so on. proc corr data=fitness nosimple out=numbers; with performance; var runtime age weight Oxygen_consumption run_pulse rest_pulse maximum_pulse; run; The above code will be giving a dataset by the name of numbers that will be saved in the work folder. The data set will have all the 7 variables.(runtime age weight Oxygen_consumption run_pulse rest_pulse maximum_pulse) along with that there will be two automatic variables by the name of _TYPE_ and _NAME_. So, just to see the correlation values, the code will be. data test; set numbers; where _Type_='CORR'; run; Now, proc corr data=fitness nosimple outp=numbers; with performance; var runtime age weight Oxygen_consumption run_pulse rest_pulse maximum_pulse; run; outp is nothing but the dataset by the name of numbers where in the values will be calculted by the pearson method. proc corr data=fitness nosimple outs=numbers; with performance; var runtime age weight Oxygen_consumption run_pulse rest_pulse maximum_pulse; run; outs is nothing but the dataset by the name of numbers where in the values will be calculated by the spearman method. ** CORRELATION is defined as the strength of linear Association. so if the value of r is 0, then this means that either there is no Association or the Association is non-linear in nature. This type of correlation is called as Hoeffding correlation. we can get the hoeffding values by mentioning outh in the above code. low values of pearson and spearman for some variable and high value of hoeffding, then we can say that it is a non-linear association. ** Correlation is used while developing a model just to check the Association between the dependent variable and the independent variables. It is also used to check the association between the independent variables. **PROC CORR is used to check the association between the dependent variable and the independent variable. It is not used to check the relationship between the variables.