When use this package, please cite this manuscript:
Li, Y., Kind, T., Folz, J. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat Methods 18, 1524–1531 (2021). https://doi.org/10.1038/s41592-021-01331-z
To search spectral files with entropy similarity, you can download pre-compiled program from https://github.com/YuanyueLi/EntropySearch/releases.
For advanced user who want to calculate spectral entropy / entropy similarity / other spectral similarity by themself, please use the Python code below.
A jupyter notebook example is provided here: https://github.com/YuanyueLi/SpectralEntropy/blob/master/example.ipynb
The detailed reference for using the 43 different algorithm to calculate spectral similarity can be found here: https://SpectralEntropy.readthedocs.io/en/master/
You might noticed a entropy similarity score higher than 1 in your self-implemented code, this is due to the mistake in merging peaks within MS2-tolerance. You can use the code implemented here to avoid this problem. We are working to provide a R-implemented code for entropy similarity, which will be released soon.
Python 3.7, numpy>=1.17.4, scipy>=1.3.2
cython>=0.29.13 (Not required but highly recommended)
# The command below is not required but strongly recommended, as it will compile the cython code to run faster
python setup.py build_ext --inplace
To calculate spectral entropy, the spectrum need to be centroid first. When you are focusing on fragment ion's information, the precursor ion may need to be removed from the spectrum before calculating spectral entropy. If isotope peak exitsted on the MS/MS spectrum, the isotope peak should be removed fist as the isotope peak does not contain useful information for identifing molecule.
Calculate spectral entropy for centroid spectrum with python is very simple (just one line with scipy package).
import numpy as np
import scipy.stats
spectrum = np.array([[41.04, 37.16], [69.07, 66.83], [86.1, 999.0]], dtype=np.float32)
entropy = scipy.stats.entropy(spectrum[:, 1])
print("Spectral entropy is {}.".format(entropy))
# The output should be: Spectral entropy is 0.3737888038158417.
print('-' * 30)
For profile spectrum which haven't been centroid, you can use a clean_spectrum
to centroid the spectrum, for
import numpy as np
import scipy.stats
import spectral_entropy
spectrum = np.array([[69.071, 7.917962], [86.066, 1.021589], [86.0969, 100.0]], dtype=np.float32)
spectrum = spectral_entropy.clean_spectrum(spectrum)
entropy = scipy.stats.entropy(spectrum[:, 1])
print("Spectral entropy is {}.".format(entropy))
# The output should be: Entropy similarity:0.2605222463607788.
print('-' * 30)
We provide a function clean_spectrum
to help you remove precursor ion, centroid spectrum and remove noise ions.
Please note that this function will not remove the isotope peak, you need to remove the isotope peak by yourself.
For example:
import numpy as np
import spectral_entropy
spectrum = np.array([[41.04, 0.3716], [69.071, 7.917962], [69.071, 100.], [86.0969, 66.83]], dtype=np.float32)
clean_spectrum = spectral_entropy.clean_spectrum(spectrum,
print("Clean spectrum will be:{}".format(clean_spectrum))
# The output should be: Clean spectrum will be:[[69.071 1. ]]
print('-' * 30)
Before calculate entropy similarity, the spectrum need to be centroid first. Remove the noise ions is highly recommend. Also, base on our test on NIST20 and Massbank.us database, remove ions have m/z higher than precursor ion's m/z - 1.6 will greatly improve the spectral identification performance.
We provide calculate_entropy_similarity
function to calculate two spectral entropy.
import numpy as np
import spectral_entropy
spec_query = np.array([[69.071, 7.917962], [86.066, 1.021589], [86.0969, 100.0]], dtype=np.float32)
spec_reference = np.array([[41.04, 37.16], [69.07, 66.83], [86.1, 999.0]], dtype=np.float32)
# Calculate entropy similarity.
similarity = spectral_entropy.calculate_entropy_similarity(spec_query, spec_reference, ms2_da=0.05)
print("Entropy similarity:{}.".format(similarity))
# The output should be: Entropy similarity:0.8984397722577456.
print('-' * 30)
We also provide 43 different spectral similarity algorithm for MS/MS spectral comparison
You can find the detail reference here: https://SpectralEntropy.readthedocs.io/en/master/
Before calculating spectral similarity, it's highly recommended to remove spectral noise. For example, peaks have intensity less than 1% maximum intensity can be removed to improve identificaiton performance.
import numpy as np
import spectral_entropy
spec_query = np.array([[69.071, 7.917962], [86.066, 1.021589], [86.0969, 100.0]], dtype=np.float32)
spec_reference = np.array([[41.04, 37.16], [69.07, 66.83], [86.1, 999.0]], dtype=np.float32)
# Calculate entropy similarity.
similarity = spectral_entropy.similarity(spec_query, spec_reference, method="entropy",
print("Entropy similarity:{}.".format(similarity))
# The output should be: Entropy similarity:0.8984397722577456.
print('-' * 30)
# Calculate unweighted entropy similarity.
similarity = spectral_entropy.similarity(spec_query, spec_reference, method="unweighted_entropy",
print("Unweighted entropy similarity:{}.".format(similarity))
# The output should be: Unweighted entropy similarity:0.9826668790176113.
print('-' * 30)
# Calculate all similarity.
all_dist = spectral_entropy.all_similarity(spec_query, spec_reference, ms2_da=0.05)
for dist_name in all_dist:
method_name = spectral_entropy.methods_name[dist_name]
print("Method name: {}, similarity score:{}.".format(method_name, all_dist[dist_name]))
# A list of different spectral similarity will be shown.
"entropy": Entropy distance
"unweighted_entropy": Unweighted entropy distance
"euclidean": Euclidean distance
"manhattan": Manhattan distance
"chebyshev": Chebyshev distance
"squared_euclidean": Squared Euclidean distance
"fidelity": Fidelity distance
"matusita": Matusita distance
"squared_chord": Squared-chord distance
"bhattacharya_1": Bhattacharya 1 distance
"bhattacharya_2": Bhattacharya 2 distance
"harmonic_mean": Harmonic mean distance
"probabilistic_symmetric_chi_squared": Probabilistic symmetric χ2 distance
"ruzicka": Ruzicka distance
"roberts": Roberts distance
"intersection": Intersection distance
"motyka": Motyka distance
"canberra": Canberra distance
"baroni_urbani_buser": Baroni-Urbani-Buser distance
"penrose_size": Penrose size distance
"mean_character": Mean character distance
"lorentzian": Lorentzian distance
"penrose_shape": Penrose shape distance
"clark": Clark distance
"hellinger": Hellinger distance
"whittaker_index_of_association": Whittaker index of association distance
"symmetric_chi_squared": Symmetric χ2 distance
"pearson_correlation": Pearson/Spearman Correlation Coefficient
"improved_similarity": Improved Similarity
"absolute_value": Absolute Value Distance
"dot_product": Dot-Product (cosine)
"dot_product_reverse": Reverse dot-Product (cosine)
"spectral_contrast_angle": Spectral Contrast Angle
"wave_hedges": Wave Hedges distance
"cosine": Cosine distance
"jaccard": Jaccard distance
"dice": Dice distance
"inner_product": Inner Product distance
"divergence": Divergence distance
"avg_l": Avg (L1, L∞) distance
"vicis_symmetric_chi_squared_3": Vicis-Symmetric χ2 3 distance
"ms_for_id_v1": MSforID distance version 1
"ms_for_id": MSforID distance
"weighted_dot_product": Weighted dot product distance"