Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to compute idfs for a custom set of packages #86

Open
Bisaloo opened this issue Nov 27, 2024 · 6 comments
Open

How to compute idfs for a custom set of packages #86

Bisaloo opened this issue Nov 27, 2024 · 6 comments

Comments

@Bisaloo
Copy link
Contributor

Bisaloo commented Nov 27, 2024

I am trying to compute idfs for a different corpus but I cannot figure out how.

The docs state that it is the output of pkgmatch_bm25()

#' @param idfs Inverse Document Frequency tables for all rOpenSci packages,
#' generated from \link{pkgmatch_bm25}. If not provided, pre-generated IDF
#' tables will be downloaded and stored in a local cache directory.

But the inputs of pkgmatch_bm25() don't match what I would expect here (I would expect the same inputs as pkgmatch_embeddings_from_pkgs()) and the output doesn't seem to match what pkgmatch_similar_pkgs() is expecting anyways.

In other words, if such as function doesn't exist yet, I would like a function pkgmatch_idfs_from_pkgs() which would be the equivalent of pkgmatch_embeddings_from_pkgs() for the idfs argument in pkgmatch_similar_pkgs().

@mpadge
Copy link
Member

mpadge commented Nov 27, 2024

Yeah, that's a good point. If you actually want or need to do it quickly, the process is in https://github.com/ropensci-review-tools/pkgmatch/blob/main/data-raw/release-data-script.R But it's definitely important for further pkg dev to properly expose this kind of functionality. It shall be done... 🚀

@mpadge
Copy link
Member

mpadge commented Dec 5, 2024

@Bisaloo Those commits add a new function, pkgmatch_generate_data(), which accepts path input to directory containing packages, plus a corpus parameter naming the corpus. That will generate all data needed for a custom corpus, and save files to same location as standard cached data. Is that heading towards what you want? It won't yet properly work, because most functions check the corpus argument against hard-typed and accepted set of values, but that's easy to expand by searching cache directory, parsing file names, and extending to any additional corpora for which local data have been generated.

But prior to that, you could just try running it on a local corpus to see what you get, and then passing the results as explicit idfs or embeddings parameters to main fns. I'll pause there to await any feedback you might have ... 👍

mpadge added a commit that referenced this issue Dec 5, 2024
@Bisaloo
Copy link
Contributor Author

Bisaloo commented Dec 5, 2024

But prior to that, you could just try running it on a local corpus to see what you get, and then passing the results as explicit idfs or embeddings parameters to main fns. I'll pause there to await any feedback you might have ... 👍

This is what I have tried here: https://github.com/epiverse-connect/epiverse-pkgmatch, with the corpus from https://epiverse-connect.r-universe.dev/.

It currently only use embeddings, as I didn't know how to compute idfs on this corpus. I managed to adapt the script you shared above (thanks!) to compute idfs but I didn't have time to check yet if it give better results than embeddings alone.

Overall, the results are reasonably good. There is nothing outrageous in the results and we often find some expected results but we also miss some expected results.

We will meet with my colleagues soon to determine if we go with this approach or something more custom.

@Bisaloo
Copy link
Contributor Author

Bisaloo commented Dec 5, 2024

I think asking the users to generate their own embeddings & idfs is good enough. The main ask was to have a mirror function of pkgmatch_embeddings_from_pkgs() for the idfs argument.

The pkgmatch_generate_data() could go one step further but I would only go for it if the maintainance burden is low.

I think it's fair to expect users who want to use their own corpus to be able to jump through a couple of well-documented hoops.

@mpadge
Copy link
Member

mpadge commented Dec 5, 2024

It will actually lower the maintenance burden, because it can simply be called to do complete local updates, and replace current script. This new function is I think a sensible generalization of that. I ran it locally on < 10 repos, and the whole thing only took a minute or two, so it seems to work pretty well. I'll nevertheless more explicitly address what you're asking for as well. Thanks!

@Bisaloo
Copy link
Contributor Author

Bisaloo commented Dec 5, 2024

Would you like to open a PR from newdata for more specific feedback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants