Google Summer of Code (GSOC) is a three month program, where selected

college students get a stipend from Google to work on ambitious open

source software projects. The program gives students the opportunity to

participate in a real-live projects, thereby learning to work

- in a non-university environment,

- in big distributed teams,

- in open source communities with the guidance of an advanced mentor.

After the program, one should work self-motivated & self-reliantly on

such projects.

For further information on GSOC, visit the website:

http://www.google-melange.com/gsoc/homepage/google/gsoc2012

2) Scikit-learn.org

Scikit-learn is a well established Python machine learning library that

is becoming a part of the Python scientific computing eco system such as Numpy and

SciPy. Scikit-learn is very user-friendly designed with a simple and consistent API and extensive documentation.

Strictly enforced coding standards and high test coverage grantee high

quality implementations. Behind Scikit-learn is a very active community,

steadily improving the library.

3) My Contribution

My goal was to implement cutting age research results to the penalized

linear models in Scikit-learn. During the three month period, my main implementations have been:

- Extending the coordinate descent solver for the penalized linear

models with covariance updates [0] by:

- prototyping the solver in Python,

- gaining a speedup using Cython,

- using BLAS calls, where appropriate,

- varying the ways to cache vector operations.

The final implemented solver was, however, not faster than the current

implementation in Scikit-learn. Since my time for this task was limited,

I had to move forward to the next implementation. The speedup here

remains an open challenge.

- Implementing Strong-Rules [1], which is a heuristic method to estimate

non-zero coefficients for elastic-net penalized linear regression

models. The aim of Strong-Rules is to reduce the input variables to the

relevant ones before fitting the model, thereby considerably reducing

the computing time of the solver. By implementing the Strong Rules

together with an active-set strategy, I achieved considerable speedups

for certain problems. For no problem, the performance was worse than in

the current Scikit-learn implementation, so that overall the solver is

an improvement for the library.

During the project, I ran into some problems and took away the following

learnings:

- Don't underestimate difficulties not explicitly mentioned in research

papers!

- High performance solver are much harder to implement than their

classroom counterparts ;)

4) Why I would recommend GSOC & Scikit-learn.

First of all, the mentoring was really great. Many thanks to Alexandre

Gramfort, who responded to my problems very quickly and supported me at

day- and nighttimes ;). Further, there is a very active mailing-list for

code reviews on Git-hub.

3 reasons to participate in Scikit-learn:

- You can work on a clean code base,

- you have fast development cycles,

- you get support by a very active and friendly developer base.

I was lucky to meet some of them personally shortly after GSOC at the

EuroSciPy conference (thanks to NumFocus for the sponsoring).

5) Future Contribution

I intend to do some code enhancements for my Strong-Rule implementation,

so that it can be included in the next Scikit-learn release and I will

certainly continue to contribute to this great project in the future.

[0] Friedman, J., T. Hastie, and R. Tibshirani. “Regularization Paths for Generalized Linear Models via Coordinate Descent.”

*Journal of Statistical Software*33, no. 1 (2010): 1.
[1] Tibshirani, R., J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R.J. Tibshirani. “Strong Rules for Discarding Predictors in Lasso-type Problems.”

*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*(2011).