Montag, 28. Mai 2012

 Setting up benchmark tests

This GSOC project aims to improve the execution time of some linear models implemented in scikit-learn. Before I start with any code enhancements I want to have an evaluation on how the actual implementations compare with other state of the art implementations. These evaluations will be based on real world data sets with different characteristics. The following shows some of the ideas and the questions that will be addressed next.

1. data sets


namesizeN (train/test)p#nz (train)used informat
a9a2.3M/1.1M32,561 / 16,281 123451,592 (11%)[2]libsvm
real-sim33.6M72,30920,9583,709,083 (0.2%)[2]libsvm
rcv113.1M/432M20,242 / 677,39947,23649,556,258 (0.15%)[2]libsvm


namesize#classN(train/test)p#nzused in format
news203.6M/0.9M2015,935 /
9,097,916 (0,03%)


namesizeNp#nzused in format
Prostate Cancer Data979dense[3]RData

[1] Friedman, J., T. Hastie, and R. Tibshirani. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33, no. 1 (2010): 1.
p.20 download data
[2] Tibshirani, R., J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R.J. Tibshirani. “Strong Rules for Discarding Predictors in Lasso-type Problems.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2011).
p.3214 download data
[3] Tibshirani, R. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267–288.

Q: Which data sets should be used?
Q: Format to store them?
Q: Other regression type data sets?

2 problems to benchmark
  • l2 loss*
  • log loss*
  • multi-logit*
with l1 and l1 & l2 penalty

Q: Settings to use in benchmarking (penalty value etc. ) ?

2.1 external reference implementations
  • glmnet
    • glmnet-python ( version? )
    • R glmnet package + rpy2 (latest version)
  • liblinear
    • liblinear + python interface (latest version)
Q: How to time execution to achieve a fair comparison ?
Q: Which glmnet interface should be used?

3 scikit-learn implementations speed development tracking
tracking of execution times for the scikit-learn implementations of (2) on data sets (1)


Q: Already some example code available for scikit-learn ?