GSOC 2012 and scikit-learn: Mai 2012

Setting up benchmark tests

This GSOC project aims to improve the execution time of some linear models implemented in scikit-learn. Before I start with any code enhancements I want to have an evaluation on how the actual implementations compare with other state of the art implementations. These evaluations will be based on real world data sets with different characteristics. The following shows some of the ideas and the questions that will be addressed next.

1. data sets

binary:

name	size	N (train/test)	p	#nz (train)	used in	format
Leukemia	1.9M	72	3571	dense	[1]	RData
Newsgroup	9.4M	11,314	777,811	0.05%	[1]	RData
Internet-Ad	49K	2359	1430	1.2%	[1]	RData
a9a	2.3M/1.1M	32,561 / 16,281	123	451,592 (11%)	[2]	libsvm
real-sim	33.6M	72,309	20,958	3,709,083 (0.2%)	[2]	libsvm
rcv1	13.1M/432M	20,242 / 677,399	47,236	49,556,258 (0.15%)	[2]	libsvm

multiclass:

name	size	#class	N(train/test)	p	#nz	used in	format
news20	3.6M/0.9M	20	15,935 / 3,993	1,355,191	9,097,916 (0,03%)	[2]	libsvm
Cancer	22M	14	144	16,063	dense	[1]	RData

regression:

name	size	N	p	#nz	used in	format
Prostate Cancer Data		97	9	dense	[3]	RData

ref:

[1] Friedman, J., T. Hastie, and R. Tibshirani. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33, no. 1 (2010): 1.

p.20 download data

[2] Tibshirani, R., J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R.J. Tibshirani. “Strong Rules for Discarding Predictors in Lasso-type Problems.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2011).

p.3214 download data

[3] Tibshirani, R. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267–288.

Q: Which data sets should be used?
Q: Format to store them?
Q: Other regression type data sets?

2 problems to benchmark

l2 loss*
log loss*
multi-logit*

with l1 and l1 & l2 penalty

Q: Settings to use in benchmarking (penalty value etc. ) ?

2.1 external reference implementations

glmnet

glmnet-python ( version? )
R glmnet package + rpy2 (latest version)

liblinear

liblinear + python interface (latest version)

Q: How to time execution to achieve a fair comparison ?
Q: Which glmnet interface should be used?

3 scikit-learn implementations speed development tracking
tracking of execution times for the scikit-learn implementations of (2) on data sets (1)

vbench

Q: Already some example code available for scikit-learn ?

Montag, 28. Mai 2012