-
Notifications
You must be signed in to change notification settings - Fork 0
GSoC 2015 Proposal: Metric Learning module
Name: Artem Sobolev
Email: [github login with dash replaced by dot]@gmail.com
Github: Barmaley-exe
Blog: http://barmaley.exe.name/
I’m a MSc-level student in Computer Science and Software Engineering at Saint-Petersburg State University, Russia. I also study Data Mining at Computer Science Center and Yandex School of Data Analysis. I’ve completed several Machine Learning classes, and did a project on recommendation systems. I also did several internships with one of them being a research-oriented ML internship.
My proposal is to introduce a new module designed for Metric Learning. This is an established area of research whose methods would be nice to have implemented in scikit-learn. Those metrics could be used to facilitate distance-based classifiers (like KNN) and clustering.
Most of metric learning models learn a positive-semidefinite matrix A, which corresponds to a Mahalanobis distance. It can be shown (x^T A y = x^T L^T L y = (L x)^T (L y)
where A = L^T L
is a Cholesky decomposition) that this is equivalent to a (linear) mapping of the data into a new space and then taking the Euclidean distance. Thus, all of, so to say, linear metric learners can be implemented as transformers: we just apply L
from the Cholesky decomposition to our data.
When it comes to nonlinear metrics, there's an interesting trick, named KernelPCA trick. Basically, one can just Pipeline
the kernel PCA and a (linear) metric learning algorithm to get the same effect as if we trained a kernelized version of the later one. Unfortunately, it doesn't work with all the algorithms, but out of those I'm proposing (LMNN, NCA and ITML, more on that later) several (LMNN and NCA) do work in that way. ITML should not be mixed with the Kernel PCA.
It's worth notion that there's a nonlinear version of NCA, which uses multilayer neural networks (a stack of RBMs) to find a nonlinear transformation f(x)
. The method seems quite heavy, and it's not clear how easy it'd be to reuse current RBM implementation (they speak about fine-tuning by backpropagation). Therefore I decided it's not worth implementing.
The core contribution of this project would be a metric_learning
(all names are preliminary) module with several different algorithms. Each of them is a transformer that utilizes y
during fit
, where y
is a usual vector of labels of training samples, just like in case of classification. Another possible application is getting a similarity matrix according to the metric learned. Thus, there will be 2 transformers for each algorithm: one maps input data from the original space into a linearly transformed one, and the other maps input data into a square similarity matrix, that can be used for clustering, for example.
Each transformer will also have a metric_
attribute to get an instance of DistanceMetric
, that can be used in KNN.
ml = LMNNTransformer()
knn = KNeighborsClassifier()
pl = Pipeline( ('ml', ml), ('knn', knn) )
pl.fit(X_train, y_train)
pl.predict(X_test)
Similarity learning:
ml = LMNNSimilarity()
sc = SpectralClustering(affinity="precomputed")
pl = Pipeline( ('ml', ml), ('sc', sc) )
pl.fit(X_train, y_train)
pl.predict(X_test)
Alternatively, since similarity is just an RBF kernel on top of usual distance, and to avoid code duplication, all the Similarity
transformers can be implemented using an adapter similar to OneVsRestClassifier
on top of usual Transformers.
I propose to implement several highly recognized and most cited algorithms:
- LMNN — Distance Metric Learning for Large Margin Nearest Neighbor Classification
- ITML — Information-Theoretic Metric Learning
- NCA — Neighbourhood Components Analysis. There is an issue to add it, but it didn't turn into a pull-request.
Get a clear understanding of all algorithms, sketch their design.
- Prepare codebase (Base classes, if needed)
- Implement NCA
- Write tests and documentation for NCA
- Submit NCA and initial
metric_learning
module for a review #1
- Implement LMNN
- Pass the mid-term
- Tests and documentation for LMNN
- Submit LMNN for a review #2
- Complete review #1
- Implement ITML
- Tests and documentation for ITML
- Submit ITML for a review #3
-
Get all reviews completed and ready to merge.
-
If time permits
-
Kernelized ITML
-
Tests and documentation for Kernel ITML
- Pencils down.
- Get everything merged.
- Submit everything to Google.
- Rule the Galaxy.