CORRELATION:

81 views Comments last seen 2018/11/12 01:31:16pm SAS
Correlation is made up of two words; CO and Relation, but never use the word Relation while
explaining Correlation.

Now, let's take an example:-

Y=f(X); Y is a function of X, so the change in X will bring a change in Y and X is the 
Cause involved, so the property wherein the cause is involved is known as the causality.
and whenever there is a cause involved, it will be a relation.

Now, let's take another example:-

Ice Cream Sales:- F(Weather)
Water Sports Activity:- F (Weather)

Now, Ice Cream Sales is a function of Weather and Water Sports Activity is a function of
Weather, So in Summers the sales of Water Sports Activity and Ice Cream will increase and
in winters, the sales will decrease, So in this there is an Association between the two
but there is no cause involved above.


NOTE:- whenever there will be a cuase involved, then it will be a Relation and if there is
no cause involved in it, then it is called as an Association.

Association:- cause will not be there
Relation:-  Cause has to be there

Relation:- It has to be an Association.
Association:- may or may not be a Relation.

* So while Explaining Correlation, use the keywords Association, Cause/Causality. Never
use the keyword Relation while explaining Correlation.

Now,

Y=F(X)
X  Y
1  1
2  2
3  3
4  4
5  5

In the above case, as the value of X is increasing the value of Y is increasing, both of
them are in the same direction, So this is a case of Monotonic and of course there is an
Association between the two. The slope of the above graph will be positive.

Y=F(X)
X Y
1 5
2 4
3 3
4 2
5 1

In the above case, as the value of X is increasing the Value of Y is decreasing, both of
them are in different direction, So this is a case of Non-Monotonic Senario. The slope of
the above graph will be negative.

Y=F(X)
X Y
1 1
2 1
3 1
4 1
5 1

In the above case, as the value of X is increasing the value of Y is not changing, hence
there is no association between the two.

Now,

X Y XMEAN YMEAN X-XMEAN Y-YMEAN X-XMEAN^2  Y-YMEAN^2 X-XMEAN*Y-YMEAN
1 1   3      3   -2      -2       4          4         4
2 2   3      3   -1      -1       1          1         1 
3 3   3      3    0       0       0          0         0
4 4   3      3    1       1       1          1         1
5 5   3      3    2       2       4          4         4


Summation of X-XMEAN^2 is 10.
Summation of Y-YMEAN^2 is 10
Summation of X-XMEAN*Y-YMEAN is 10.

Formula of Correlation is:-
Summation of X-XMEAN*Y-YMEAN/Square root of X-XMEAN^2*Y-YMEAN^2

r=10/square root of 10*10

r=10/10= +1

Similarly in case of,

X Y
1 5
2 4
3 3
4 2
5 1

The value of r comes out to be -1.

and, Similarly in case of

X Y
1 1
2 1 
3 1
4 1
5 1 

The value of r comes out to be 0.


Now, there are two types of Correlation:-

1. Pearson Correlation:-
This correlation is happening on the exact values/ Absolute values.

r is nothing but called as the coefficient of correlation or correlation coefficient.

Now, if the value of r is close to 0, then there is less Association.
and if the value of r is towards +1 or -1 there is more Association.

Let's take an Example.

cosider the value of r=0.63
                 and r=-0.7

So, in the above scenario, r with value -0.7 will be more associated rather than 0.63.
Plus and minus will just tell the direction. It will tell you about the positive correlation
or negative correlation.

Correlation is a perfect example of Bi-variate Analysis as both the variables will be
continous in nature.

r is written as RHO.

Null Hypothesis:- r=0
Alternate Hypothesis:- r not equal to 0.

For correlation to Exists, we will have to reject the Null Hypothesis.


2. Ranked Correlation or Spearman Correlation:-

This type of Correlation happens when we have the outliers in the dataset and we cannot
remove the outliers. This correlation will not happen on the absolute vales rather than 
it will happen on the rank values or there bucket number.

So, for Example we have numbers:-
1
2
3
4
5
6
6000

So, in the above case, 6000 is an outlier and just because of this number, the value of
r will be biased. So instead of using these number, we will divide them into buckets
and then we will perform the correlation on there bucket number. we are doing this just
to reduce the effect of outliers.


Syntax:-

Proc corr data= Name of the Data Set;
with Variable Name;
var Variable Name;
run;

Note:- If we just write Proc corr;
                        run;

Then the last generated Data Set will be taken automatically and there will not be any 
error in this case.

Let's take a Data Set to understand more about Correlation:-

data fitness;
    input @1 Name $8. @10 Gender $1. @12 RunTime 5.2 @18 Age 2. @21 Weight 5.2 
        @27 Oxygen_Consumption 5.2 @33 Run_Pulse 3.
        @37 Rest_Pulse 2. @40 Maximum_Pulse 3.;
    Performance=260-round(10*runtime + 2*age + 4*(Gender='F'));
    datalines;
Donna    F  8.17 42 68.15 59.57 166 40 172
Gracie   F  8.63 38 81.87 60.06 170 48 186
Luanne   F  8.65 43 85.84 54.30 156 45 168
Mimi     F  8.92 50 70.87 54.63 146 48 155
Chris    M  8.95 49 81.42 49.16 180 44 185
Allen    M  9.22 38 89.02 49.87 178 55 180
Nancy    F  9.40 49 76.32 48.67 186 56 188
Patty    F  9.63 52 76.32 45.44 164 48 166
Suzanne  F  9.93 57 59.08 50.55 148 49 155
Teresa   F 10.00 51 77.91 46.67 162 48 168
Bob      M 10.07 40 75.07 45.31 185 62 185
Harriett F 10.08 49 73.37 50.39 168 67 168
Jane     F 10.13 44 73.03 50.54 168 45 168
Harold   M 10.25 48 91.63 46.77 162 48 164
Sammy    M 10.33 54 83.12 51.85 166 50 170
Buffy    F 10.47 52 73.71 45.79 186 59 188
Trent    M 10.50 52 82.78 47.47 170 53 172
Jackie   F 10.60 47 79.15 47.27 162 47 164
Ralph    M 10.85 43 81.19 49.09 162 64 170
Jack     M 10.95 51 69.63 40.84 168 57 172
Annie    F 11.08 51 67.25 45.12 172 48 172
Kate     F 11.12 45 66.45 44.75 176 51 176
Carl     M 11.17 54 79.38 46.08 156 62 165
Don      M 11.37 44 89.47 44.61 178 62 182
Effie    F 11.50 48 61.24 47.92 170 52 176
George   M 11.63 47 77.45 44.81 176 58 176
Iris     F 11.95 40 75.98 45.68 176 70 180
Mark     M 12.63 57 73.37 39.41 174 58 176
Steve    M 12.88 54 91.63 39.20 168 44 172
Vaughn   M 13.08 44 81.42 39.44 174 63 176
William  M 14.03 45 87.66 37.39 186 56 192
;
run;

proc corr data=fitness;
run;

Once, we will run the above code there will be two blocks in the results window, 1st will
be the simple statistics block and 2nd will be the correlation block and by default it
will take pearson correlation. This means that the pearson correlation is by default.
There will be a matrix that will be seen in the second block and it will be a 8 by 8 matrix.
As there are 8 variables that are continous in nature. There will be two values with each
combination, the top value will show you the value of r and the bottom value will show you the
probability. The diagonal will have the value as 1. Now, we have to look for the all the
combinations where in the value of p is less than alpha. (alpha is nothing but the significance
level or the error while performing an experiment and the value of alpha is 0.05).

Proc corr data=fitness nosimple;
run;

The above code will make sure that the Simple Statistics block will not be there in the
results window.

Now,

proc corr data=fitness nosimple;
with performance;
var runtime age weight Oxygen_consumption run_pulse rest_pulse maximum_pulse;
run;

The above code will have a matrix of 1 by 7 where in the variable Performance will have
the correlation value with the other variables.

proc corr data=fitness nosimple rank;
with performance;
var runtime age weight Oxygen_consumption run_pulse rest_pulse maximum_pulse;
run;

The above code will make sure that the correlation values will be in order i.e highest 
correlated value will be at the first place and so on.

proc corr data=fitness nosimple out=numbers;
with performance;
var runtime age weight Oxygen_consumption run_pulse rest_pulse maximum_pulse;
run;

The above code will be giving a dataset by the name of numbers that will be saved in the
work folder. The data set will have all the 7 variables.(runtime age weight 
Oxygen_consumption run_pulse rest_pulse maximum_pulse) along with that there will be two
automatic variables by the name of _TYPE_ and _NAME_.

So, just to see the correlation values, the code will be.

data test;
set numbers;
where _Type_='CORR';
run;

Now,

proc corr data=fitness nosimple outp=numbers;
with performance;
var runtime age weight Oxygen_consumption run_pulse rest_pulse maximum_pulse;
run;

outp is nothing but the dataset by the name of numbers where in the values will be calculted
by the pearson method.

proc corr data=fitness nosimple outs=numbers;
with performance;
var runtime age weight Oxygen_consumption run_pulse rest_pulse maximum_pulse;
run;

outs is nothing but the dataset by the name of numbers where in the values will be calculated
by the spearman method.

** CORRELATION is defined as the strength of linear Association.
so if the value of r is 0, then this means that either there is no Association or the
Association is non-linear in nature.

This type of correlation is called as Hoeffding correlation. we can get the hoeffding values
by mentioning outh in the above code.

low values of pearson and spearman for some variable and high value of hoeffding, then 
we can say that it is a non-linear association.

** Correlation is used while developing a model just to check the Association between
the dependent variable and the independent variables. It is also used to check the association
between the independent variables. 

**PROC CORR is used to check the association between the dependent variable and the 
independent variable. It is not used to check the relationship between the variables.

  Comments