ブートストラップ回帰分析(case ,residual bootstrap )の設例による自由研究

This presents an elementary bootstrapping simple regression model using Anscombe's quartet data.

アンスコムの４図（Anscombe's quartet)はよく知られたグラフで計量経済学のテキストでもよく引用されている。たとえばG.S.Maddala(1992) Introduction to Econometrics Prentice-Hall和合肇（訳）計量経済分析の方法（第2版）、シーエーピー出版　1996　でも取り上げられている。アンスコムは（J.Amscombe,,"Graphs in Statistical Analysis"The American Statistician Vol.27No.1,February,1973）の中で４つの設例を使って、幾つかの教訓を示してくれている。回帰分析をする場合に単に統計ソフトの数値出力結果のみを見て判断すると回帰式の適切な診断ができないこともある。アンスコムの４つのケースでは単純回帰分析をすると回帰係数、決定係数等は等しい結果を得るが、グラフで見るとデータの散布状態が大きく異なっている。ここで幾つかの教訓がみえてくる。例えば（１）回帰分析をする前にグラフを描いてデータのパターンをよく調べ、回帰モデルが線形か非線形のどちらが適当かどうか見極める必要性と（２）データ数が少ないときに外れ値（outlier)が存在すると最小２乗法での推定結果に大きな影響を与えることなどだろう。最小２乗法は残差の２乗和を最小にするように計算するので回帰直線から大きく乖離した外れ値は推定計算に大きな影響を及ぼすことになる。

アンスコムの設例についてRを使って回帰分析をしてみると以下のようになる。


x1<-c(10,8,13,9,11,14,6,4,12,7,5)
x1

##  [1] 10  8 13  9 11 14  6  4 12  7  5

x4<-c(8,8,8,8,8,8,8,19,8,8,8)
x4

##  [1]  8  8  8  8  8  8  8 19  8  8  8

y1<-c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68)
y1

##  [1]  8.04  6.95  7.58  8.81  8.33  9.96  7.24  4.26 10.84  4.82  5.68

y2<-c(9.14,8.14,8.74,8.77,9.26,8.1,6.13,3.1,9.13,7.26,4.74)
y2

##  [1] 9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74

y3<-c(7.46,6.77,12.74,7.11,7.81,8.81,6.08,5.39,8.15,6.42,5.73)
y3

##  [1]  7.46  6.77 12.74  7.11  7.81  8.81  6.08  5.39  8.15  6.42  5.73

y4<-c(6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.5,5.56,7.91,6.89)
y4

##  [1]  6.58  5.76  7.71  8.84  8.47  7.04  5.25 12.50  5.56  7.91  6.89

par(mfrow=c(2,2))

plot(x1,y1)
modx1<-lm(y1 ~ x1)
summx1<-summary(modx1)

title(main=paste("(A)      y1 = ", round(summx1$coefficients[1],2), "+ ", 
                round(summx1$coefficients[2],2 ), "x1    決定係数 =",
                 round(summx1$r.squared,2)),cex.main=0.7)

abline(modx1,col="blue",lwd=2)

plot(x1,y2)
modx2<-lm(y2 ~ x1)
summx2<-summary(modx2)

title(main=paste("(B)     y2 = ", round(summx2$coefficients[1],2), "+ ", 
                 round(summx2$coefficients[2],2 ), "x1    決定係数=",
                 round(summx2$r.squared,2)),cex.main=0.7)

abline(modx2,col="blue",lwd=2)

plot(x1,y3)
modx3<-lm(y3 ~ x1)
summx3<-summary(modx3)
summx3

## 
## Call:
## lm(formula = y3 ~ x1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.179 -0.610 -0.229  0.152  3.249 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.012      1.127    2.67   0.0255 * 
## x1             0.498      0.118    4.22   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.664,  Adjusted R-squared:  0.627 
## F-statistic: 17.8 on 1 and 9 DF,  p-value: 0.00225

title(main=paste("(C)     y3 = ", round(summx3$coefficients[1],2), "+ ", 
                 round(summx3$coefficients[2],2 ), "  決定係数 =",
                 round(summx3$r.squared,2)),cex.main=0.7)

abline(modx3,col="blue",lwd=2)



plot(x4,y4)
modx4<-lm(y4 ~ x4)
summx4<-summary(modx4)

title(main=paste("(D)     y4 = ", round(summx4$coefficients[1],2), "+ ", 
                 round(summx4$coefficients[2],2 ), "  決定係数=",
                 round(summx4$r.squared,2)),cex.main=0.7)

abline(modx4,col="blue",lwd=2)

4つの設例でのxyplotをみるとそれぞれ特徴がありB図では非線形のモデルが適合するように見える。C図とD図は大きな外れ値が存在している。しかし、どのケースでも単純回帰分析をすると定数項、傾きなどの回帰係数は同じになり決定係数も同じになっている。

外れ値の取扱はそれ自体が大きなテーマとなり、とても手に負えないので深入りはできないが、安易に、これを測定ミス、集計ミスと判断して除外することもできない。ここでは、例えば水面下で何か構造変化が進んでおり氷山の一角が外れ値として顕在化したと仮定して、そのまま使うことにする。(suppose that structural changes are progressing below the surface of the water, and that the tip of the iceberg has become apparent in the form of outliers like case C.)

ブートストラップ法は回帰分析にも応用できるのでアンスコムの４図のうち（C)のケースについてブートストラップ回帰を試みることにする。なお回帰モデルは単純線形回帰に限定して適用してみる。

従属変数ｙ、説明変数ｘとしてｙ＝ a ＋ bx　とういう線形回帰モデルを考え、例えばｙは食品消費量、ｘは家計所得としてもよい。実際の観測値（y₁,x₁),(y₂,x₂).....(y_n,x_n)といったｎ個のデータ（初期標本）が得られたとする。回帰分析にブートストラップ法を応用するには主に２つのアプローチがある。

（１）ペアによるブートストラップ
(prired bootstrap)

初期標本データ（y₁,x₁),(y₂,x₂).....(y_n,x_n)についてペアのまま無作為復元抽出をしてブートストラップ標本を生成する。（y_i,x_i),は１つのオブジェクトとして扱われるので複数回、同じペア（オブジェクト）が抽出される可能性がある。このようにして生成された標本を使って最小二乗法でa、bの推定値を計算する。この作業を繰り返すことで回帰係数、a,bの推定量のブートストラップ分布が求まる。これはpairs bootstrap,あるいはcase　bootstrapと呼ばれている。(（y_i,x_i) is treated as one object and we sample with replacement n times from these n objects for resampling.)

（２）残差(residual)によるブートストラップ(residual bootstrap)

もう一つの方法は初期標本から回帰モデルの係数を推定して従属変数の実績値yと予測値y^{^}の残差uを計算する。ここで得たｎ個の残差uを基に無作為復元抽出でブートストラップ残差を求める。ここからa＋bx＋u＝ｙ^＊の計算で従属変数ｙのブートストラップ標本ｙ^＊が計算できる。このようにして得た（y^＊₁,x₁)、（y^＊₂,x₁)．．．のデータを使って回帰係数を推定する。これを繰り返すことで回帰係数a、bの推定量のブートストラップ分布が得られる。このアプローチは回帰モデルの関数形が適合していることが前提になっている。(The bootstrapping residuals involves resampling the residuals from a fit regression model to generate the bootstrap samples.)

なお、Rを使ってブートストラップ回帰を行う場合はパッケージcar の関数Bootが使いやすい。回帰モデルをオブジェクトとしてパッケージbootに引渡し、詳細なパラメーター設定の手間を省いてブートストラップ回帰が行えるように工夫されている。

はじめにアンスコムのケース（A)についてcase bootstrap　とresidual bootstrap　を行ってみる。

最初にcase bootstrapを行ってみる。Boot（Bが大文字である点に注意）ではmethodのdefaultがcase bootstrapになっている。　

par(mfrow=c(1,1))


################bootstrap regression 
## x1  y2


suppressMessages(library(car))

xydata<-as.data.frame(cbind(y2,x1))


mod2b <- lm(y2 ~ x1, data=xydata)

set.seed(123) 
mod2boot <- Boot(mod2b, R=10000)

##  要求されたパッケージ boot をロード中です

summary(mod2boot, high.moments=TRUE)

## 
## Number of bootstrap replications R = 10000 
##             original bootBias bootSE bootMed bootSkew bootKurtosis
## (Intercept)      3.0   0.1083  1.550   2.925  0.82526         1.74
## x1               0.5   0.0014  0.166   0.502  0.00905         1.17

mod2boot

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot::boot(data = dd, statistic = boot.f, R = R, .fn = f, parallel = p_type, 
##     ncpus = ncores, cl = cl2)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1*   3.0009 0.1083313     1.55049
## t2*   0.5000 0.0014008     0.16617

Confint(mod2boot, parm=1:2, level=c(.90, .95), type="perc")

## Bootstrap percent confidence intervals
## 
##             Estimate   2.5 %     5 %    95 %  97.5 %
## (Intercept)   3.0009 0.50755 0.92840 5.95413 6.70511
## x1            0.5000 0.16023 0.22774 0.77381 0.84671

hist(mod2boot, legend="separate")

傾きの偏りは少なく初期標本による回帰モデルの推定値の精度を支持するような結果となっている。

次に残差ブートストラップを行ってみる。Ｂｏｏｔでmethod=residualの追加の指示を与える。

residboot <- Boot(mod2b, R=10000, method="residual")
summary(residboot, high.moments=TRUE)

## 
## Number of bootstrap replications R = 10000 
##             original  bootBias bootSE bootMed bootSkew bootKurtosis
## (Intercept)      3.0 -0.001901  1.160   3.041  -0.1916       -0.194
## x1               0.5  0.000191  0.122   0.499   0.0193       -0.176

residboot

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot::boot(data = dd, statistic = boot.f, R = R, .fn = f, parallel = p_type, 
##     ncpus = ncores, cl = cl2)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1*   3.0009 -0.00190075     1.16027
## t2*   0.5000  0.00019122     0.12174

Confint(residboot, parm=1:2, level=c(.90, .95), type="perc")

## Bootstrap percent confidence intervals
## 
##             Estimate   2.5 %     5 %    95 %  97.5 %
## (Intercept)   3.0009 0.68310 1.02451 4.82627 5.14011
## x1            0.5000 0.26402 0.29974 0.70178 0.73617

hist(residboot, legend="separate")

残差ブートストラップでも、傾きの偏りも少なく初期標本による回帰モデルの推定値の精度を支持するような結果となっている。

次にアンスコムのケース（Ｃ）についてブートストラップ回帰を行ってみる。（Ｃ）では外れ値が含まれている。データの分布が歪んでいたり大きな外れ値があるような場合にはcase bootstrap　は良い結果が得られず危険であるという専門家の指摘もあるが、ここでは試しにどんな結果になるか実験してみる。通常、経済学やその他社会科学では説明変数を制御しつつ従属変数のデータを得るといった実験データ分析が適用できないので（y_i,x_i)をペアとしてリサンプルせざるを得ないといった事情もあるので敢えてcase bootstrapを試みてみる。

################bootstrap regression 
## x1  y3

xydata2<-as.data.frame(cbind(y3,x1))

mod3b <- lm(y3 ~ x1, data=xydata2)

set.seed(123) 
mod3boot <- Boot(mod3b, R=10000)
summary(mod3boot, high.moments=TRUE)

## 
## Number of bootstrap replications R = 10000 
##             original bootBias bootSE bootMed bootSkew bootKurtosis
## (Intercept)    3.012 -0.05042  1.059   3.055   -1.208         3.15
## x1             0.498  0.00199  0.147   0.494    0.701         0.20

mod3boot

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot::boot(data = dd, statistic = boot.f, R = R, .fn = f, parallel = p_type, 
##     ncpus = ncores, cl = cl2)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1*  3.01200 -0.0504241     1.05855
## t2*  0.49836  0.0019905     0.14695

Confint(mod3boot, parm=1:2, level=c(.90, .95), type="perc")

## Bootstrap percent confidence intervals
## 
##             Estimate   2.5 %     5 %    95 %  97.5 %
## (Intercept)  3.01200 0.56259 1.07179 4.02556 4.02923
## x1           0.49836 0.34214 0.34252 0.76504 0.82544

hist(mod3boot, legend="separate")

回帰係数の歪度をみると０．７でやや右側に偏っている。傾きの分布を見ても山が２つになり右方向（プラス方向）に歪んでいるのが分かる。外れ値のような観測値が将来に頻出してくれば傾きは０．５よりも大きな方向に変化することを暗示しているとも解釈できるかもしれないが、初期標本による回帰モデルの推定値の精度を支持するのにはこのような結果は使いにくいかもしれない。

次に残差ブートストラップを行ってみる

residboot2 <- Boot(mod3b, R=10000, method="residual")
summary(residboot2, high.moments=TRUE)

## 
## Number of bootstrap replications R = 10000 
##             original bootBias bootSE bootMed bootSkew bootKurtosis
## (Intercept)    3.012  0.03013  1.184   2.890   0.5965        0.670
## x1             0.498 -0.00289  0.123   0.496  -0.0328        0.592

residboot2

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot::boot(data = dd, statistic = boot.f, R = R, .fn = f, parallel = p_type, 
##     ncpus = ncores, cl = cl2)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1*  3.01200  0.0301347      1.1843
## t2*  0.49836 -0.0028943      0.1233

Confint(residboot2, parm=1:2, level=c(.90, .95), type="perc")

## Bootstrap percent confidence intervals
## 
##             Estimate   2.5 %    5 %   95 %  97.5 %
## (Intercept)  3.01200 0.99740 1.2964 5.1852 5.78178
## x1           0.49836 0.23747 0.2881 0.7015 0.74369

hist(residboot2, legend="separate")

残差ブートストラップでは、傾きの偏りも少なく初期標本による回帰モデルの推定値の精度を支持するような結果となっている。初期標本から得られた回帰モデルをベースにしているので外れ値があっても傾きの分布に極端な歪みは見られない。初期標本から推定した回帰モデルの関数形が適合しているとすれば、回帰モデルの推定値を精度を支持しているように見える。

最後に従属変数ｙを対数変換した場合のブートストラップ回帰をしてみる。外れ値がある場合に対数変換することもあるので試しに実験してみた。もちろん安易に対数変換するとゼロに近い小数を対数変換すると新たなマイナスの外れ値を生み出す危険もあるがアンスコムの例ではそのようなデータがないので対数変換してみた。


par(mfrow=c(1,1))


## case (C)で従属変数yだけ対数変換してみる

mod4blg<- lm(log(y3) ~ x1, data=xydata2)
plot(x1,log(y3),data=xydata2)
abline(mod4blg)

summary(mod4blg)

## 
## Call:
## lm(formula = log(y3) ~ x1, data = xydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1298 -0.0498 -0.0102  0.0122  0.3029 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.4121     0.1057   13.37  3.1e-07 ***
## x1            0.0638     0.0111    5.76  0.00027 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.116 on 9 degrees of freedom
## Multiple R-squared:  0.787,  Adjusted R-squared:  0.763 
## F-statistic: 33.2 on 1 and 9 DF,  p-value: 0.000272

従属変数を対数変換すると回帰モデルの決定係数が少し大きくなっている。

これにcase bootstrapを行う。

set.seed(123) 
mod4lgboot <- Boot(mod4blg, R=10000)
summary(mod4lgboot, high.moments=TRUE)

## 
## Number of bootstrap replications R = 10000 
##             original  bootBias bootSE bootMed bootSkew bootKurtosis
## (Intercept)   1.4121 -0.003603 0.0975  1.4172   -1.066       2.4856
## x1            0.0638  0.000185 0.0138  0.0632    0.673       0.0454

mod4lgboot

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot::boot(data = dd, statistic = boot.f, R = R, .fn = f, parallel = p_type, 
##     ncpus = ncores, cl = cl2)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 1.412135 -0.00360258    0.097516
## t2* 0.063826  0.00018455    0.013759

Confint(mod4lgboot, parm=1:2, level=c(.90, .95), type="perc")

## Bootstrap percent confidence intervals
## 
##             Estimate    2.5 %      5 %     95 %   97.5 %
## (Intercept) 1.412135 1.190567 1.233384 1.522147 1.530557
## x1          0.063826 0.046958 0.047703 0.089047 0.094222

hist(mod4lgboot, legend="separate")

傾きの分布を見るとやはり右に偏っている。

次に残差ブートストラップを行う。

residboot3lg <- Boot(mod4blg, R=10000, method="residual")
summary(residboot3lg, high.moments=TRUE)

## 
## Number of bootstrap replications R = 10000 
##             original  bootBias bootSE bootMed bootSkew bootKurtosis
## (Intercept)   1.4121  0.002858 0.1113  1.4020    0.562        0.659
## x1            0.0638 -0.000274 0.0116  0.0636   -0.033        0.587

residboot3lg

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot::boot(data = dd, statistic = boot.f, R = R, .fn = f, parallel = p_type, 
##     ncpus = ncores, cl = cl2)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 1.412135  0.00285811    0.111325
## t2* 0.063826 -0.00027408    0.011587

Confint(residboot3lg, parm=1:2, level=c(.90, .95), type="perc")

## Bootstrap percent confidence intervals
## 
##             Estimate    2.5 %      5 %     95 %   97.5 %
## (Intercept) 1.412135 1.221095 1.250768 1.617242 1.670427
## x1          0.063826 0.039196 0.044087 0.082888 0.087045

hist(residboot3lg, legend="separate")

par(mfrow=c(1,1))

残差ブートストラップでは、傾きの偏りも少なく初期標本による回帰モデルの推定値の精度を支持するような結果となっている。