Sonntag, 29. April 2012

Samstag, 14. April 2012

This is my proposal for the Google Summer of Code 2012.

Optimizing sparse linear models using coordinate descent and strong rules.


Scikit-learn is a Python machine learning library that aims to be easy to use through a well designed interface. Dependencies are kept to a minimum and the extensive use of NumPy, SciPy and Matplotlib give great computing power and ease the understanding of the codebase.
Data sets with far more features than samples are rapidly gaining on popularity and bring the class of linear models back to focus. This project will bring additional state of the art optimization routines for sparse linear models to scikit-learn and even further reduce dependencies. This task includes high test coverage, documentation, examples and intensive benchmarking.

Linear models are used for regression as well as for classification tasks. In both cases penalty terms can be used to obtain an implicit feature selection. To efficiently solve these penalized linear models is the main focus of this project.

The here proposed improvements will be beneficial for a wide range of general problems and specialized domains such as gene expression analysis, compressed sensing, and sparse encoding.

During the project, I will implement glmnet[1], a cyclic coordinate descent algorithm to solve penalized linear regression and classification problems. Considerable speedup is achieved for example through the use of weighted updates and active set of features.
With the implementation of strong rules[2] a large number of variables can be excluded before the optimization. This rule is especially powerful in sparse problems. A speedup factor of 20 and more is reported[1] for certain problems over the standard glmnet implementation.
As a complement to glmnet the LARS algorithm will be extended to deal with elastic net (L1 +L2) penalties.