GSOC 2012 and scikit-learn: Juni 2012

Donnerstag, 28. Juni 2012

Time Is Flying By

Good news, the refactoring of the elastic-net and lasso classes is done and has been merged into the scikit-learn master some time ago.

I invested some time learning about the hdf5 data format to upload some dataset to mldata.org . Sadly the documentation of mldata.org is very sparse and in case of the used hdf5 data format, outdated too. I have been in contact with the maintainer but I still don't have working specs. I decided to put this task at rest for the moment.

I have a Python prototype working for the new coordinate descent algorithm that I'm now about to integrate step by step into scikit-learn .

This version will be written in Cython to speed thins up. I hope that I can beat the execution time of the current implementation soon.

Sonntag, 10. Juni 2012

Refactoring linear model code and first benchmark plots

During the last week, I spend some time refactoring the Elastic Net and Lasso classes in scikit-learn. The goal is to obtain a unique interface for the sparse and dense implementations of this model. The format of the input data is used to determine, whether the sparse implementation is used or not. This has the advantage that the user can use the same class for sparse and dense data.

The first benchmark reports are available thanks to Vlads work to adapt vbench to work with scikit-learn ( check out his blog post ).

The plot shows how the execution time changed over time during the past 180 days, for fitting an elastic-net penalized logistic regression on the leukemia data set.

Scikit-learn currently wraps the LIBLINEAR library to fit logistic regression models. As a comparison, the implementation of glmnet 1.7.4 using the R package glmnet took 188 milliseconds to fit the leukemia data set. Next week will bring more benchmark evaluations on different data sets.