This GSOC project aims to improve the execution time of some linear models implemented in scikit-learn. Before I start with any code enhancements I want to have an evaluation on how the actual implementations compare with other state of the art implementations. These evaluations will be based on real world data sets with different characteristics. The following shows some of the ideas and the questions that will be addressed next.
1. data sets
binary:
name | size | N (train/test) | p | #nz (train) | used in | format |
Leukemia | 1.9M | 72 | 3571 | dense | [1] | RData |
Newsgroup | 9.4M | 11,314 | 777,811 | 0.05% | [1] | RData |
Internet-Ad | 49K | 2359 | 1430 | 1.2% | [1] | RData |
a9a | 2.3M/1.1M | 32,561 / 16,281 | 123 | 451,592 (11%) | [2] | libsvm |
real-sim | 33.6M | 72,309 | 20,958 | 3,709,083 (0.2%) | [2] | libsvm |
rcv1 | 13.1M/432M | 20,242 / 677,399 | 47,236 | 49,556,258 (0.15%) | [2] | libsvm |
multiclass:
name | size | #class | N(train/test) | p | #nz | used in | format |
news20 | 3.6M/0.9M | 20 | 15,935 / 3,993 | 1,355,191 |
9,097,916 (0,03%)
| [2] | libsvm |
Cancer | 22M | 14 | 144 | 16,063 | dense | [1] | RData |
regression:
name | size | N | p | #nz | used in | format |
Prostate Cancer Data | 97 | 9 | dense | [3] | RData |
ref:
[1] Friedman, J., T. Hastie, and R. Tibshirani. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33, no. 1 (2010): 1.
p.20 download data
[2]
Tibshirani, R., J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor,
and R.J. Tibshirani. “Strong Rules for Discarding Predictors in
Lasso-type Problems.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2011).
p.3214 download data
[3] Tibshirani, R. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267–288.
Q: Which data sets should be used?
Q: Format to store them?
Q: Other regression type data sets?
2 problems to benchmark
- l2 loss*
- log loss*
- multi-logit*
Q: Settings to use in benchmarking (penalty value etc. ) ?
2.1 external reference implementations
- glmnet
- glmnet-python ( version? )
- R glmnet package + rpy2 (latest version)
- liblinear
- liblinear + python interface (latest version)
Q: Which glmnet interface should be used?
3 scikit-learn implementations speed development tracking
tracking of execution times for the scikit-learn implementations of (2) on data sets (1)
vbench
Q: Already some example code available for scikit-learn ?
Keine Kommentare:
Kommentar veröffentlichen