Montag, 28. Mai 2012

 Setting up benchmark tests


This GSOC project aims to improve the execution time of some linear models implemented in scikit-learn. Before I start with any code enhancements I want to have an evaluation on how the actual implementations compare with other state of the art implementations. These evaluations will be based on real world data sets with different characteristics. The following shows some of the ideas and the questions that will be addressed next.


1. data sets

binary:

namesizeN (train/test)p#nz (train)used informat
Leukemia1.9M723571dense[1]RData
Newsgroup9.4M11,314777,8110.05%[1]RData
Internet-Ad49K235914301.2%[1]RData
a9a2.3M/1.1M32,561 / 16,281 123451,592 (11%)[2]libsvm
real-sim33.6M72,30920,9583,709,083 (0.2%)[2]libsvm
rcv113.1M/432M20,242 / 677,39947,23649,556,258 (0.15%)[2]libsvm



multiclass:

namesize#classN(train/test)p#nzused in format
news203.6M/0.9M2015,935 /
3,993
1,355,191
9,097,916 (0,03%)
[2]libsvm
Cancer22M1414416,063dense[1]RData




regression:

namesizeNp#nzused in format
Prostate Cancer Data979dense[3]RData


ref:
[1] Friedman, J., T. Hastie, and R. Tibshirani. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33, no. 1 (2010): 1.
p.20 download data
[2] Tibshirani, R., J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R.J. Tibshirani. “Strong Rules for Discarding Predictors in Lasso-type Problems.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2011).
p.3214 download data
[3] Tibshirani, R. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267–288.

Q: Which data sets should be used?
Q: Format to store them?
Q: Other regression type data sets?


2 problems to benchmark
  • l2 loss*
  • log loss*
  • multi-logit*
with l1 and l1 & l2 penalty

Q: Settings to use in benchmarking (penalty value etc. ) ?

2.1 external reference implementations
  • glmnet
    • glmnet-python ( version? )
    • R glmnet package + rpy2 (latest version)
  • liblinear
    • liblinear + python interface (latest version)
Q: How to time execution to achieve a fair comparison ?
Q: Which glmnet interface should be used?


3 scikit-learn implementations speed development tracking
tracking of execution times for the scikit-learn implementations of (2) on data sets (1)

vbench

Q: Already some example code available for scikit-learn ?