R入門4

Page content

women is a built-in data set in R, which holds the height and weight of 15 American women from ages 30 to 39.

print(head(women))
  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129
str(women)
'data.frame':   15 obs. of  2 variables:
 $ height: num  58 59 60 61 62 63 64 65 66 67 ...
 $ weight: num  115 117 120 123 126 129 132 135 139 142 ...
# 15行ある
print(nrow(women))
[1] 15

相関を求める $cov{xy}=\frac{\sum{}{}(x - \bar{x})(y-\bar{y})}{(n-1)}$

print(cov(women$weight, women$height))
[1] 69

cov関数は少し不安定で、例えば下のように、関連性は変わっていないのに、相関の大きさは変わってしまう。

print(cov(women$weight, women$height*2.54))
[1] 175.26

上の不安定問題を解決するために、Pearson's correlation coefficientを使うことができる。

Pearson’s correlation coefficient is usually denoted by r and its equation is given as follows:

$$ r = \frac{\sum_{}{} (x-\bar{x})(y-\bar{y})}{(n-1)s_xx_y}$$

which is the covariance divided by the product of the two variables’ standard deviation.

print(cor(women$weight, women$height))
[1] 0.9954948
print(cor(women$weight, women$height*2.54))
[1] 0.9954948

If there is a strong relationship between two variables, but the relationship is not linear, it cannot be represented accurately by Pearson’s r.

xs <- 1:100
print(cor(xs, xs+100)) # OK 
print(cor(xs, xs^3)) # cubic relationship NO!!!
[1] 1
[1] 0.917552

Pearson’s r assumes a linear relationship between two variables. There are, however, other correlation coefficients that are more tolerant of non-linear relationships. Probably the most common of these is Spearman’s rank coefficient, also called Spearman's rho. Spearman’s rho is calculated by taking the Pearson correlation not of the values, but of their ranks.

$$ \rho = \frac{6\sum_{}{}d_i^2}{n(n^2 -1)}$$

$\rho$ = Spearman rank correlation

$d_i$ = the difference between the ranks of corresponding variables

$n$ = number of observations

xs <- 1:100
print(cor(xs, xs+100, method="spearman")) # OK 
print(cor(xs, xs^3, method="spearman")) # OK
[1] 1
[1] 1

irisデータセットはのカラム5は数字ではないので、それを取り除く。

iris.nospecies <- iris[, -5]
print(head(iris.nospecies))

print(cor(iris.nospecies))

Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

This produces a correlation matrix (when it is done with the covariance, it is called a covariance matrix).

About Wang Zhijun
機械学習好きなプログラマー