This Python notebook shows the process of benchmarking the search result ranking for the Charles Explorer application. It is a part of my diploma thesis at the Faculty of Mathematics and Physics, Charles University, Prague.
Find out more about the thesis in the GitHub repository.
Made by Jindřich Bär, 2024.
As we have noticed before, the search results ranking we have acquired from Scopus is mostly based on the raw query relevance. Because of this, it might be also riddled with the problems we have discussed in the thesis - such as the susceptibility to search engine optimization of certain papers.
In search for a better ranking for Charles Explorer, we try to predict the citation count of the papers based on the local graph measures we have calculated in the previous notebooks. The citation count is a good indicator of the importance of the paper that cannot be easily interfered with by the authors. Unfortunately the citation count is not available in our dataset (hence the need for prediction / approximation of it).
For our previously acquired (balanced) dataset of representative queries and results, we collect the publication citation counts by querying the Scopus API.
import pandas as pd
queries = pd.read_csv('./best_queries.csv')
scopus = pd.read_csv('./scopus_results.csv')
explorer = pd.read_csv('./filtered_search_results.csv')
charles = set(explorer['name'].str.lower().str.slice(0, 50))
elsevier = set(scopus['title'].str.lower().str.slice(0, 50))
explorer[~explorer['name'].str.lower().str.slice(0, 50).isin(elsevier)].reset_index(drop=True).to_csv('missing/explorer_missing.csv', index=False)
import json
import subprocess
def get_query_object(query):
return {
"documentClassificationEnum": "primary",
"query": f"TITLE-ABS-KEY({query})",
"cluster": ["scoaffilctry,\"Czech Republic\",t"],
"filters": {"yearFrom": "Before 1960","yearTo": "Present"},
"sort": "r-f",
"itemcount": 10,
"offset": 0,
"showAbstract": False
def get_curl_call(query):
def call_scopus_api(query):
result =, check=True, shell=True, stdout=subprocess.PIPE)
return json.loads(result.stdout.decode('utf-8'))
import pandas as pd
import re
missing_publications = pd.read_csv('missing/explorer_missing.csv')
missing_names = missing_publications['name'].str.lower().str.slice(0, 50).unique()
pub_data = pd.read_csv('missing/scopus_new_data_complete.csv')
for i, name in enumerate(list(missing_names[3040:])):
result = call_scopus_api(re.sub('[^0-9a-zA-Z\s]+', ' ', name))['items']
if pub_data is None:
pub_data = pd.DataFrame.from_dict(result)
x = pd.concat([pub_data, pd.DataFrame.from_dict(result)])
pub_data = x
if i % 10 == 0:
print(f"Processed {3040+i}/{len(missing_names)}")
except Exception as e:
print(f"Error processing {name}: {e}")
pub_data.to_csv('missing/scopus_new_data_complete.csv', index=False)
import pandas as pd
import re
loaded = pd.read_csv('missing/scopus_new_data_complete.csv')
missing = pd.read_csv('missing/explorer_missing.csv')
missing['index'] = missing.index
missing['normalized_name'] = missing['name'].apply(lambda x: re.sub('[^0-9a-zA-Z\s]+', ' ', x).lower()[0:50])
loaded['normalized_name'] = loaded['title'].apply(lambda x: re.sub('[^0-9a-zA-Z\s]+', ' ', x).lower()[0:50])
loaded.set_index('normalized_name', inplace=True)
joined = missing.join(loaded, on='normalized_name', how='left')
joined = joined[~joined['title'].isna()]
import ast
joined['citation_count'] = joined['citations'].map(lambda x: ast.literal_eval(x)['count'])
filtered = pd.DataFrame(joined[['id', 'name', 'citation_count']])
filtered.reset_index(drop=True).to_csv('missing/citation_counts.csv', index=False)
In the process described with the code cells above, go over a set of 7189
publications we have retrieved from Charles Explorer and try to match them to publications in Scopus by their name. If the match is found, we store the citation count in a new column.
In the case of our balanced dataset, we have found a match for 962
While this might seem low, it still provides us with a good enough dataset to train and test our models.
As the next step, we load the graph measures we have calculated in the previous notebooks and join those with the records containing the citation counts.
import pandas as pd
katz = pd.read_csv('katz.csv')
node_cuts = pd.read_csv('node_cuts.csv')
centrality = pd.read_csv('local_centrality.csv')
degrees = pd.read_csv('degrees.csv')
citations = pd.read_csv('missing/citation_counts.csv')
df = citations.merge(
katz, left_on='id', right_on='id', how='left'
node_cuts, left_on='id', right_on='id', how='left'
centrality, left_on='id', right_on='id', how='left'
degrees, left_on='id', right_on='id', how='left'
id | name | citation_count | katz_centrality | node_cut | query_x | centrality_1 | centrality_2 | query_y | degree | query | |
0 | 80319 | Deception in social psychological experiments:... | 40 | 0.4000 | 99 | social psychology | 0.000015 | 0.000005 | social psychology | 2 | social psychology |
1 | 323832 | Epilogue | 0 | 0.2000 | 189 | social psychology | 0.000000 | 0.000000 | social psychology | 1 | social psychology |
2 | 571872 | Stereotypes, norms, and political correctness | 2 | 0.2000 | 56 | social psychology | 0.000000 | 0.000000 | social psychology | 1 | social psychology |
3 | 510254 | Signs of Myelin Impairment in Cerebrospinal Fl... | 10 | 0.6000 | 283 | clinical neurology | 0.000082 | 0.000306 | clinical neurology | 3 | clinical neurology |
4 | 512548 | Atypical language representation in children w... | 10 | 1.6000 | 208 | clinical neurology | 0.000765 | 0.004019 | clinical neurology | 8 | clinical neurology |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1231 | 533264 | Selective IgM deficiency: clinical and laborat... | 26 | 1.4000 | 127 | immunopathology | 0.000360 | 0.000809 | immunopathology | 7 | immunopathology |
1232 | 477343 | Two types of CMV ocular complications in patie... | 2 | 1.0000 | 343 | immunopathology | 0.000171 | 0.005273 | immunopathology | 5 | immunopathology |
1233 | 641559 | Kirche der Freiheit and A Systems Theory Of Ch... | 0 | 0.3584 | 2 | systems theory | 0.000000 | 0.000000 | systems theory | 1 | systems theory |
1234 | 168916 | On the Issue of Learning of Social Science in ... | 0 | 0.4000 | 214 | systems theory | 0.001700 | 0.000072 | systems theory | 2 | systems theory |
1235 | 639120 | Sport Migration Influences on Cultural Brand I... | 1 | 0.4000 | 54 | systems theory | 0.000204 | 0.001406 | systems theory | 2 | systems theory |
1236 rows × 11 columns
To see whether there is any correlation between the citation count and the graph measures, we calculate the correlation matrix on the graph measures and the citation count per publication.
df[['citation_count', 'katz_centrality', 'node_cut', 'centrality_1', 'centrality_2', 'degree']].corr()
citation_count | katz_centrality | node_cut | centrality_1 | centrality_2 | degree | |
citation_count | 1.000000 | 0.282355 | 0.071930 | 0.134054 | 0.003482 | 0.368689 |
katz_centrality | 0.282355 | 1.000000 | 0.046096 | 0.251490 | 0.062201 | 0.819643 |
node_cut | 0.071930 | 0.046096 | 1.000000 | 0.119445 | 0.176158 | 0.148508 |
centrality_1 | 0.134054 | 0.251490 | 0.119445 | 1.000000 | 0.576928 | 0.307011 |
centrality_2 | 0.003482 | 0.062201 | 0.176158 | 0.576928 | 1.000000 | 0.091475 |
degree | 0.368689 | 0.819643 | 0.148508 | 0.307011 | 0.091475 | 1.000000 |
Note that this is in (approximately) in line with the results of experiments conducted by Alireza Abbasi and Jörn Altmann in .
We see that the the highest correlation between two variables is 0.8196
between the katz_centrality
and degree
. This is partially expected from the definition of the Katz centrality, which iteratively - with growing distance - sums the number of nodes reachable from the given node in this distance.
Aside from this correlation, we also see a correlation between the ego- and 2-hop-neighborhood betweenness centralities. This is also expected, as the 2-hop-neighborhood considered in the second measure is a superset of the ego-neighborhood and might capture the similar information.
As for the correlation with the citation count, we see that the highest correlation is with the degree
measure, which is 0.3687
. This is a good sign, as the degree is a simple measure of the number of connections of the node. It also partially confirms out assumption that the publication degree can be a good proxy for a publication importance - and that publications with more coauthors tend to be more cited. Perhaps due to the high correlation between the degree
and Katz centrality, the katz centrality also shows a good correlation with the citation count.
To see how well the graph measures can predict the citation count, we train the same models as in the previous experiment - a simple linear regression model and a neural network model.
## Using sklearn, we learn a neural network model to predict the location of the document in the Scopus Search results
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
projection = df[[
'centrality_1', 'centrality_2',
'katz_centrality', 'node_cut',
projection['katz_centrality'] = projection['katz_centrality'].replace([float('inf')], projection['katz_centrality'].median())
feature_columns = ['centrality_1', 'centrality_2', 'degree', 'katz_centrality', 'node_cut']
X = projection[[*feature_columns, 'id']]
y = projection['citation_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
linear_model = Ridge()
linear_pipeline = Pipeline([
('scaler', StandardScaler()),
('linear', linear_model)
])[feature_columns], y_train)
y_pred = linear_pipeline.predict(X_test[feature_columns])
print(f"MSE: {mean_squared_error(y_test, y_pred)}")
print(list(zip(feature_columns, linear_model.coef_)))
MSE: 6725.050765413087
[('centrality_1', 13.606487405580742), ('centrality_2', -12.04390116805357), ('degree', 35.422981810917456), ('katz_centrality', -5.464560624987053), ('node_cut', 3.060149167695211)]
from sklearn.neural_network import MLPRegressor
neural_network = MLPRegressor(max_iter=1000, hidden_layer_sizes=(100, 100))
pipeline = Pipeline([
('scaler', StandardScaler()),
('neural_network', neural_network)
])[feature_columns], y_train)
y_pred = pipeline.predict(X_test[feature_columns])
print(f"MSE: {mean_squared_error(y_test, y_pred)}")
MSE: 7425.5727259536125
We see that unlike in the reranking experiment, the regularized linear regression did better than the neural network.
Both of the method performed quite bad, though - with the linear regression having a mean squared error of 6725.0508
and the neural network having a mean squared error of 7517.2390
As expected from the correlation matrix exploration, the linear regression model has the highest coefficient for the degree
measure, which is 35.4230
. The coefficient for the other measures is quite low (even negative at points), which might be due to the high correlation between the degree
and the other measures.
This is also in line with the expermients conducted in - while they use a variable called degree centrality
, this is simply the node degree in the graph.
While the results of these experiments do not seem to provide much insight into the citation count prediction, we still might try to utilize them for the search result ranking in the Charles Explorer application.
Note that the ranking benchmarks (and human users of search engines) might not be interested in the exact citation count, but rather in the relative importance of the publications. Both benchmarks and users also tend to discount the publications further down in the search result list.
Because of our original goal, i.e. reorder the search results to set the more "globally" important papers higher - we might try to use the graph measures to reorder the search results with better results. We calculate the ndcg score with the citation count as the relevance feedback and the predicted citation count as the ordering.
We start by utilizing the graph measures to predict the citation count for the publications and add this as a new attribute column.
df['katz_centrality'] = df['katz_centrality'].replace([float('inf')], df['katz_centrality'].median())
df['predicted_citation_count_l'] = linear_pipeline.predict(df[feature_columns])
df['predicted_citation_count_n'] = pipeline.predict(df[feature_columns])
from typing import List
import numpy as np
import pandas as pd
def get_dcg(relevances: List[int]):
dcg = 0
for i, relevance in enumerate(relevances):
dcg += relevance / np.log2(i + 2)
return dcg
original_dcgs = []
idcgs = []
reranked_dcgs_l = []
reranked_dcgs_n = []
for query in df['query'].unique():
original_dcgs.append(get_dcg(df[df['query'] == query]['citation_count']))
idcgs.append(get_dcg(df[df['query'] == query]['citation_count'].sort_values(ascending=False)))
reranked_dcgs_l.append(get_dcg(df[df['query'] == query].sort_values('predicted_citation_count_l', ascending=False)['citation_count']))
reranked_dcgs_n.append(get_dcg(df[df['query'] == query].sort_values('predicted_citation_count_n', ascending=False)['citation_count']))
dcgs = pd.DataFrame({
'query': df['query'].unique(),
'original_ndcg': np.array(original_dcgs) / np.array(idcgs),
'reranked_ndcg_l': np.array(reranked_dcgs_l) / np.array(idcgs),
'reranked_ndcg_n': np.array(reranked_dcgs_n) / np.array(idcgs)
pl = dcgs.plot(kind='box')
pl.set_title('NDCG scores for rerankings wrt. citation count')
pl.set_ylabel('NDCG over 100 queries')
pl.set_xticklabels(['Relevance based\n ranking', 'Graph rerank +\nLinear aggregation', 'Graph rerank + \n Neural network aggregation'])
/tmp/ipykernel_133778/ RuntimeWarning: invalid value encountered in divide
'original_ndcg': np.array(original_dcgs) / np.array(idcgs),
/tmp/ipykernel_133778/ RuntimeWarning: invalid value encountered in divide
'reranked_ndcg_l': np.array(reranked_dcgs_l) / np.array(idcgs),
/tmp/ipykernel_133778/ RuntimeWarning: invalid value encountered in divide
'reranked_ndcg_n': np.array(reranked_dcgs_n) / np.array(idcgs)
original_ndcg | reranked_ndcg_l | reranked_ndcg_n | |
count | 100.000000 | 100.000000 | 100.000000 |
mean | 0.719249 | 0.848755 | 0.850198 |
std | 0.219230 | 0.185739 | 0.185111 |
min | 0.278943 | 0.278943 | 0.333333 |
25% | 0.531919 | 0.678916 | 0.702819 |
50% | 0.693623 | 0.968237 | 0.950909 |
75% | 0.995240 | 1.000000 | 1.000000 |
max | 1.000000 | 1.000000 | 1.000000 |
Surprisingly, both methods improve the ndcg score of the search results ranking by quite a significant margin. This suggests that while the local graph neighborhood structure cannot serve as a good predictor of the citation count, it can be quite useful for inferring the citation-count based ranking.