Dienstag, 18. September 2012

My Google Summer of Code (2012) Experience with scikit-learn

1) Google Summer of Code
Google Summer of Code (GSOC) is a three month program, where selected
college students get a stipend from Google to work on ambitious open
source software projects. The program gives students the opportunity to
participate in a real-live projects, thereby learning to work
- in a non-university environment,
- in big distributed teams,
- in open source communities with the guidance of an advanced mentor.

After the program, one should work self-motivated & self-reliantly on
such projects.
For further information on GSOC, visit the website:
http://www.google-melange.com/gsoc/homepage/google/gsoc2012


2) Scikit-learn.org
Scikit-learn is a well established Python machine learning library that
is becoming a part of the Python scientific computing eco system such as Numpy and
SciPy. Scikit-learn is very user-friendly designed with a simple and consistent API and extensive documentation.
Strictly enforced coding standards and high test coverage grantee high
quality implementations. Behind Scikit-learn is a very active community,
steadily improving the library.

3) My Contribution
My goal was to implement cutting age research results to the penalized
linear models in Scikit-learn. During the three month period, my main implementations have been:

- Extending the coordinate descent solver for the penalized linear
models with covariance updates [0] by:
- prototyping the solver in Python,
- gaining a speedup using Cython,
- using BLAS calls, where appropriate,
- varying the ways to cache vector operations.

The final implemented solver was, however, not faster than the current
implementation in Scikit-learn. Since my time for this task was limited,
I had to move forward to the next implementation. The speedup here
remains an open challenge.

- Implementing Strong-Rules [1], which is a heuristic method to estimate
non-zero coefficients for elastic-net penalized linear regression
models. The aim of Strong-Rules is to reduce the input variables to the
relevant ones before fitting the model, thereby considerably reducing
the computing time of the solver. By implementing the Strong Rules
together with an active-set strategy, I achieved considerable speedups
for certain problems. For no problem, the performance was worse than in
the current Scikit-learn implementation, so that overall the solver is
an improvement for the library.

During the project, I ran into some problems and took away the following
learnings:
- Don't underestimate difficulties not explicitly mentioned in research
papers!
- High performance solver are much harder to implement than their
classroom counterparts ;)


4) Why I would recommend GSOC & Scikit-learn.
First of all, the mentoring was really great. Many thanks to Alexandre
Gramfort, who responded to my problems very quickly and supported me at
day- and nighttimes ;). Further, there is a very active mailing-list for
code reviews on Git-hub.

3 reasons to participate in Scikit-learn:
- You can work on a clean code base,
- you have fast development cycles,
- you get support by a very active and friendly developer base.

I was lucky to meet some of them personally shortly after GSOC at the
EuroSciPy conference (thanks to NumFocus for the sponsoring).


5) Future Contribution
I intend to do some code enhancements for my Strong-Rule implementation,
so that it can be included in the next Scikit-learn release and I will
certainly continue to contribute to this great project in the future.

[0] Friedman, J., T. Hastie, and R. Tibshirani. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33, no. 1 (2010): 1.

[1] Tibshirani, R., J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R.J. Tibshirani. “Strong Rules for Discarding Predictors in Lasso-type Problems.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2011).

Keine Kommentare:

Kommentar veröffentlichen