women is a built-in data set in R, which holds the height and weight of 15 American women from ages 30 to 39.

  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129
'data.frame':   15 obs. of  2 variables:
 $ height: num  58 59 60 61 62 63 64 65 66 67 ...
 $ weight: num  115 117 120 123 126 129 132 135 139 142 ...
# 15行ある
[1] 15

相関を求める $cov{xy}=\frac{\sum{}{}(x - \bar{x})(y-\bar{y})}{(n-1)}$

print(cov(women$weight, women$height))
[1] 69


print(cov(women$weight, women$height*2.54))
[1] 175.26

上の不安定問題を解決するために、Pearson's correlation coefficientを使うことができる。

Pearson’s correlation coefficient is usually denoted by r and its equation is given as follows:

$$ r = \frac{\sum_{}{} (x-\bar{x})(y-\bar{y})}{(n-1)s_xx_y}$$

which is the covariance divided by the product of the two variables’ standard deviation.

print(cor(women$weight, women$height))
[1] 0.9954948
print(cor(women$weight, women$height*2.54))
[1] 0.9954948

If there is a strong relationship between two variables, but the relationship is not linear, it cannot be represented accurately by Pearson’s r.

xs <- 1:100
print(cor(xs, xs+100)) # OK 
print(cor(xs, xs^3)) # cubic relationship NO!!!
[1] 1
[1] 0.917552

Pearson’s r assumes a linear relationship between two variables. There are, however, other correlation coefficients that are more tolerant of non-linear relationships. Probably the most common of these is Spearman’s rank coefficient, also called Spearman's rho. Spearman’s rho is calculated by taking the Pearson correlation not of the values, but of their ranks.

$$ \rho = \frac{6\sum_{}{}d_i^2}{n(n^2 -1)}$$

$\rho$ = Spearman rank correlation

$d_i$ = the difference between the ranks of corresponding variables

$n$ = number of observations

xs <- 1:100
print(cor(xs, xs+100, method="spearman")) # OK 
print(cor(xs, xs^3, method="spearman")) # OK
[1] 1
[1] 1


iris.nospecies <- iris[, -5]


Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

This produces a correlation matrix (when it is done with the covariance, it is called a covariance matrix).

