select_dimensions and initial dimensionality #812

daxpryce · 2021-07-14T18:33:29Z

daxpryce
Jul 14, 2021
Maintainer

https://github.com/microsoft/graspologic/blob/dev/graspologic/embed/svd.py#L131

Are we sure we want to take the log2 of the size for our initial "svd lite" before we actually do the ghodsi zhu elbow finding?

Obviously the actual size is crazy pants, but log2 is ... I mean, it's small. On a 50 node graph we only have 6 dimensions to try to find the 2nd elbow in. On 10k node graph, we only have 16 dimensions to find the 2nd elbow.

@Nyecarr ran a bunch of parameter sweeps with ASE, in conjunction with sklearn's logistic regression test to try to assess the importance of dimensionality in assessing the accuracy of the predicted labels.

Using the elbow finder, because the dimensionality is always so low (even if you pick an elbow cut like 10 or something), at most we were getting like 12 dimensions - and the best accuracy was around 40%. If we manually set the dimensionality to something like 100, elbow_cut=None, we were getting accuracy around 80% (and something like ~92% if we used the n_components=matrix.shape[0])

Nick is going to post some graphs showing his results in a response to this discussion, but I'm curious to see what everyone else's thoughts are.

@j1c @asaadeldin11 @bdpedigo @bryantower

nicaurvi · 2021-07-14T19:17:54Z

nicaurvi
Jul 14, 2021

Using Cora as the dataset and logistic regression with given paper labels:

Baseline - varying n_components (8, 16, ..., 2048, graph.number_of_nodes() - 1)

n_elbow sweep - [1, 2, ..., 5]

In the elbow sweep, the dimensions are massively reduced due to the log2(g.number_of_nodes()) when calculating the initial svd to do elbow cuts over. It should be noted that due to the log2 and the size of this graph, if we use the elbow finder the max dimensions we embed into would be 12. Dotted line is the dimension where the algo determined there was an elbow:

If we do elbow cuts on the singular values for the full svd (this is what topologic does) this is what I see:

@bryantower mentioned that perhaps we should be doing a full svd and then taking a log2 of the resulting singular values to use for elbow finding:

You can get an idea of the accuracy if those elbows were chosen using the very first image.

This raises a couple of interesting questions.

Do we need to do more than one SVD? If so (I don't think we need to), how many singular values should we extract on the first SVD to do the elbow finding?
When is the correct time to apply the log2? Should it truly be on the number of nodes, on the singular values, or somewhere else?

Some other things that may or may not be interesting:

Running n2v with all nodes in Cora gets about 86% accuracy
Running a GCN with only 100 training nodes gets about 86% accuracy. Over 97% on the training nodes

6 replies

daxpryce Jul 14, 2021
Maintainer Author

I don't suppose we could drop the notebook in here so they could try to replicate?

nicaurvi Jul 14, 2021

from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures

import networkx as nx
import torch_geometric as tg
import torch as torch
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
import graspologic as gp

# get dataset from pytorch
dataset = Planetoid(root='data/Planetoid', name='Cora', transform=NormalizeFeatures())
data = dataset[0]  # Get the first graph object.

# build nx graph
graph: nx.Graph = tg.utils.to_networkx(data)
labels = data.y.tolist()

for s, t in graph.edges():
    graph.add_edge(s, t, weight=1)  # undirected and effectively unweighted

# n_component sweep
accuracies = dict()
for dimensions in [
    8,
    16,
    32,
    64,
    128,
    256,
    512,
    1024,
    2048,
    graph.number_of_nodes() - 1
]:
    ase = gp.embed.AdjacencySpectralEmbed(n_components=dimensions).fit_transform(graph)

    clf = LogisticRegression(random_state=0).fit(ase, labels)
    accuracies[dimensions] = clf.score(ase, labels)
    

plt.plot(list(accuracies.keys()), list(accuracies.values()))
plt.xlabel('Dimensions (n_components)')
plt.ylabel('Accuracy')
plt.title('ASE on Cora')

plt.show()

# n_elbow sweep
accuracies = dict()
shapes = []
for elbow in [
    1, 2, 3, 4, 5
]:
    ase = gp.embed.AdjacencySpectralEmbed(n_elbows=elbow).fit_transform(graph)

    clf = LogisticRegression(random_state=0).fit(ase, labels)
    accuracies[elbow] = clf.score(ase, labels)
    shapes.append(ase.shape)
    

plt.plot(list(accuracies.keys()), list(accuracies.values()))
plt.xlabel('Number of Elbows (n_elbows)')
plt.ylabel('Accuracy')
plt.title('ASE on Cora')

plt.show()

bdpedigo Jul 19, 2021
Maintainer

@nyecarr @daxpryce my concern with this experiment is that since there is no train/test split, as dimensionality increases, I think it becomes easier and easier for Logistic Regression to basically memorize the position of each point. in high dimensions you can almost always find some combination of dimensions that isolate each point, and thus you can classify them as whatever you need to. so i worry that this is why you can get perfect accuracy as dimension increases.

daxpryce Aug 3, 2021
Maintainer Author

we should also make sure we do this over the LCC and do PTR too

bdpedigo Aug 3, 2021
Maintainer

A few modifications to experiment (mostly just showing things to be exhaustive)

train test split (at least after the embedding, could also do out of sample)
can show more than 1 classifier (e.g. KNN)
can show pass to ranks results as well
operate on LCC

Otherwise, can try to find a few more datasets and check results on those as well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

select_dimensions and initial dimensionality #812

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

select_dimensions and initial dimensionality #812

daxpryce Jul 14, 2021 Maintainer

Replies: 1 comment · 6 replies

nicaurvi Jul 14, 2021

daxpryce Jul 14, 2021 Maintainer Author

nicaurvi Jul 14, 2021

bdpedigo Jul 19, 2021 Maintainer

daxpryce Aug 3, 2021 Maintainer Author

bdpedigo Aug 3, 2021 Maintainer

daxpryce
Jul 14, 2021
Maintainer

Replies: 1 comment 6 replies

nicaurvi
Jul 14, 2021

daxpryce Jul 14, 2021
Maintainer Author

bdpedigo Jul 19, 2021
Maintainer

daxpryce Aug 3, 2021
Maintainer Author

bdpedigo Aug 3, 2021
Maintainer