Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Similarity metrics (New) #53

Open
parantak opened this issue May 12, 2020 · 6 comments
Open

Similarity metrics (New) #53

parantak opened this issue May 12, 2020 · 6 comments
Labels
question Further information is requested

Comments

@parantak
Copy link
Contributor

  1. Needleman-Wunsch Algorithm
  2. Smith-Waterman Algorithm
    These algorithms were originally developed for DNA sequencing but I read on SO, that they are at times used as string similarity metrics as well as they account for mismatches and gaps (spaces). Moreover, we can penalize gaps and mismatches according to a value the users choice.
    Should we implement this, @rajaswa and @someshsingh22? If yes, then I'll do it in some time.
@rajaswa
Copy link
Member

rajaswa commented May 12, 2020

It'd be great if we have a source for this. It won't make sense putting efforts into something which won't be used eventually.

@rajaswa rajaswa added the question Further information is requested label May 12, 2020
@someshsingh22
Copy link
Member

@parantak As much as I remember especially Smith-Waterman Algorithm isn't relevant for string, in case of DNA it matches a particular sequence, so if the minority sentence was a discount union of subsets of another string it would give 100% similairty-
"My name" and "My name is Somesh Singh" will be 100% same (Smith-Waterman looks after exclusion) while Needle-Wunshch looks for inclusion. So personally I don't think it would be that useful. Check what we used for BIOF110 here and you will notice the difference

@parantak
Copy link
Contributor Author

parantak commented May 13, 2020

@someshsingh22 Right, sorry. I vaguely remembered them both. I searched a bit, and I believe Smith-Waterman is a local alignment algorithm whereas Needle-Wunsch is a global alignment algorithm. So, I guess Smith-Waterman might not be as relevant. However, I am sure Needlman-Wunsch should be a good metric because unlike static penalties in Levenshtein, the algorithm implements different penalties for matches, mismatches, and gaps.
As for Smith-Waterman, I'll look into it soon to gain a better understanding and to make sure we aren't missing out on anything.

@someshsingh22
Copy link
Member

Yes, that was my point, I don't remember Needleman-Wunsch well either. Do look for some supporting literature in similar domains of NLP before you move on though.

@parantak
Copy link
Contributor Author

@someshsingh22 Yeah, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants