Subsample Winner Algorithm for Large-p Regression

Y. Fan, J. Sun and X. Qiao

Subsample Winner Algorithm (SWA) for Regression finds the important variables (or features) among p variables (X) in explaining a response (Y) from a large-p regression data (X, Y). The SWA is analogous to how a national merits scholar is determined. It first uses a base procedure, here a linear regression, on each of subsamples drawn from the p variables, and then computes the scores of all features, i.e., the p variables, according to the performance of these features collected in each of the subsample analyses. It then obtains the `semifinalist' of the features based on the resulting scores and determines the `finalists,' i.e., the important features, from the `semifinalist.'

Table of Content
1. Software subsamp
2. How to install the R package subsamp, subsamp()
3. How to run subsamp()
4. References

1. Software subsamp

The SWA algorithm, subsamp(), allows a user input for parameters (s, q, m), where "s" is the subsample size, "q" the number of semi-finalists and "m" the number of subsamples or the iteration size. The output of subsamp() is a list of the finalists, i.e., the important features with their p-values, estimated coefficients and other corresponding statistics, with an option of plotting a multi-panel weights plot, W-plot, to guide a selection of the subsample size "s". Here is the default usage:

subsamp(x, y, s, m = 1000, qnum, wplot = F)

Thus, the default m=100, and "q" is "qnum" here. To view the multi-panel plot for confirming or selecting "s," set "wplot=T" inside "subsamp."

2. How to install, load and unload the package subsamp in R

In the following, all R commands are in blue fonts and the R outputs are in brown fonts.
  install.packages("subsamp")   # install subsamp package 
  library(subsamp) 		# load subsamp package 
  detach(package:subsamp)   	# unload subsamp package 

3. How to run subsamp in R

Upload and get help:
  library(subsamp)
  ?subsamp                      #To see the help file
Consider our build-in example by running the following R commands:
  n <- 80; p <- 100
  set.seed(2017)
  x <- matrix(rnorm(n*p),n)
  coefs <- c(.1, .5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 5)
  y <- x[, 1:length(coefs)]%*%coefs + rnorm(n)
  d <- data.frame(y,x)
The number of true features in the data with the data frame "d" is 10. The first three features, X1-X3, are hard to distinguish from noises, while the next two X4 and X5 have a weak signal, X6 has a moderate signal, and X7-X10 have a strong signal in comparison with the noises in the data. A good feature selection procedure should capture X7-X10, at least. The ideal "s" should be 10 or slightly bigger.

One can try using the standard linear regression by lm():

    lm(y~x, d) 
 
As expected, its ouput is a bunch of garbage, hence not shown here.

Next, we run our SWA with the default values for m and wplot, as well as a smaller value of s than desired (as we do not know what is desired in practice):

   subsamp(x, y, s=8, qnum=10)  	#1st run 
 
which leads to the following output:

Call:
lm(formula = y ~ X10 + X9 + X8 + X7 + X82 + X47 + X5, data = Data.final)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7524 -2.0673 -0.1295  1.3865  6.2082 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.1205     0.3183   0.379   0.7061    
X10           5.1715     0.3214  16.090  < 2e-16 ***
X9            4.0475     0.3197  12.658  < 2e-16 ***
X8            3.4370     0.3354  10.249 1.03e-15 ***
X7            2.6745     0.3318   8.062 1.18e-11 ***
X82           0.5581     0.3245   1.720   0.0898 .  
X47           0.5766     0.3478   1.658   0.1016    
X5            1.5756     0.3529   4.465 2.91e-05 ***
---
Signif. codes:  0  `***'   0.001 `**'   0.01 `*'  0.05 `.' 0.1 ` ' 1 

Residual standard error: 2.67 on 72 degrees of freedom
Multiple R-squared:  0.9102,    Adjusted R-squared:  0.9015 
F-statistic: 104.2 on 7 and 72 DF,  p-value: < 2.2e-16
This application of our SWA has captured X7-X10 and X5, with an adjusted R^2=0.9015.

Next, we try a bigger s and include a W-plot for diagnostics:

 #png("swa-ex1.png")                          #Uncomment this to save the resulting picture into a png file   
 subsamp(x, y, s=10, qnum=10, wplot=TRUE)     #2nd run 
 
that outputs:
Call:
lm(formula = y ~ X10 + X8 + X9 + X7 + X6 + X5 + X17 + X73, data = Data.final)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2751 -1.3235  0.0921  1.1333  3.5947 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.1025     0.2175   0.471   0.6389    
X10           5.0646     0.2153  23.524  < 2e-16 ***
X8            3.4123     0.2280  14.963  < 2e-16 ***
X9            3.7985     0.2189  17.354  < 2e-16 ***
X7            2.7257     0.2237  12.182  < 2e-16 ***
X6            1.8814     0.2042   9.215 9.42e-14 ***
X5            1.4865     0.2334   6.368 1.65e-08 ***
X17          -0.3072     0.2068  -1.485   0.1418    
X73           0.3280     0.1803   1.819   0.0731 .  
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` '  1

Residual standard error: 1.805 on 71 degrees of freedom
Multiple R-squared:  0.9595,    Adjusted R-squared:  0.955 
F-statistic: 210.4 on 8 and 71 DF,  p-value: < 2.2e-16

This second application of SWA has captured X5-X10, as expected from the truth. The adjusted R^2 is 0.955, which is a significant improvement to that from our first application above. The multi-panel W-plot is here:
swa-ex1.png
Figure 1: Multi-panel plot for a set of s vlaues.

We look for a value of s, at which the plot is `stabilized' from this point on. By being `stabilized,' we mean that the upper-arm sets in the sub-graphs in the W-plot from this value of `s,' say 10, onward, are similar. This diagnostic weights plot indicates that s=10 is a good choice.

4. References

[1] Fan, Y. and Sun, J. (2017). Subsampling for Feature Selection in Large Regression Data. In Revision.
[2] Qiao, X. and Sun. J. (2017). SWA -classification. Manuscript
[3] Qiao, X., Sun, J., and Fan, Y. (2017). SWA-classification package http://people.math.binghamton.edu/qiao/swa.html
[4] Mirror site of this pacakge: sr2c.case.edu/swa-reg