R入門4
women
is a built-in data set in R, which holds the height and weight of 15 American women from ages 30 to 39.
print(head(women))
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
str(women)
'data.frame': 15 obs. of 2 variables:
$ height: num 58 59 60 61 62 63 64 65 66 67 ...
$ weight: num 115 117 120 123 126 129 132 135 139 142 ...
# 15行ある
print(nrow(women))
[1] 15
相関を求める $cov{xy}=\frac{\sum{}{}(x - \bar{x})(y-\bar{y})}{(n-1)}$
print(cov(women$weight, women$height))
[1] 69
cov
関数は少し不安定で、例えば下のように、関連性は変わっていないのに、相関の大きさは変わってしまう。
print(cov(women$weight, women$height*2.54))
[1] 175.26
上の不安定問題を解決するために、Pearson's correlation coefficient
を使うことができる。
Pearson’s correlation coefficient is usually denoted by r and its equation is given as follows:
$$ r = \frac{\sum_{}{} (x-\bar{x})(y-\bar{y})}{(n-1)s_xx_y}$$
which is the covariance divided by the product of the two variables’ standard deviation.
print(cor(women$weight, women$height))
[1] 0.9954948
print(cor(women$weight, women$height*2.54))
[1] 0.9954948
If there is a strong relationship between two variables, but the relationship is not linear, it cannot be represented accurately by Pearson’s r.
xs <- 1:100
print(cor(xs, xs+100)) # OK
print(cor(xs, xs^3)) # cubic relationship NO!!!
[1] 1
[1] 0.917552
Pearson’s r assumes a linear relationship between two variables. There are, however, other correlation coefficients that are more tolerant of non-linear relationships. Probably the most common of these is Spearman’s rank coefficient, also called Spearman's rho
. Spearman’s rho is calculated by taking the Pearson correlation not of the values, but of their ranks.
$$ \rho = \frac{6\sum_{}{}d_i^2}{n(n^2 -1)}$$
$\rho$ = Spearman rank correlation
$d_i$ = the difference between the ranks of corresponding variables
$n$ = number of observations
xs <- 1:100
print(cor(xs, xs+100, method="spearman")) # OK
print(cor(xs, xs^3, method="spearman")) # OK
[1] 1
[1] 1
irisデータセットはのカラム5は数字ではないので、それを取り除く。
iris.nospecies <- iris[, -5]
print(head(iris.nospecies))
print(cor(iris.nospecies))
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
This produces a correlation matrix (when it is done with the covariance, it is called a covariance matrix
).