How to compute idfs for a custom set of packages #86

Bisaloo · 2024-11-27T19:02:43Z

I am trying to compute idfs for a different corpus but I cannot figure out how.

The docs state that it is the output of pkgmatch_bm25()

Lines 50 to 52 in caa1dad

    
           #' @param idfs Inverse Document Frequency tables for all rOpenSci packages, 
        
           #' generated from \link{pkgmatch_bm25}. If not provided, pre-generated IDF 
        
           #' tables will be downloaded and stored in a local cache directory.

But the inputs of pkgmatch_bm25() don't match what I would expect here (I would expect the same inputs as pkgmatch_embeddings_from_pkgs()) and the output doesn't seem to match what pkgmatch_similar_pkgs() is expecting anyways.

In other words, if such as function doesn't exist yet, I would like a function pkgmatch_idfs_from_pkgs() which would be the equivalent of pkgmatch_embeddings_from_pkgs() for the idfs argument in pkgmatch_similar_pkgs().

The text was updated successfully, but these errors were encountered:

mpadge · 2024-11-27T21:02:13Z

Yeah, that's a good point. If you actually want or need to do it quickly, the process is in https://github.com/ropensci-review-tools/pkgmatch/blob/main/data-raw/release-data-script.R But it's definitely important for further pkg dev to properly expose this kind of functionality. It shall be done... 🚀

mpadge · 2024-12-05T11:31:09Z

@Bisaloo Those commits add a new function, pkgmatch_generate_data(), which accepts path input to directory containing packages, plus a corpus parameter naming the corpus. That will generate all data needed for a custom corpus, and save files to same location as standard cached data. Is that heading towards what you want? It won't yet properly work, because most functions check the corpus argument against hard-typed and accepted set of values, but that's easy to expand by searching cache directory, parsing file names, and extending to any additional corpora for which local data have been generated.

But prior to that, you could just try running it on a local corpus to see what you get, and then passing the results as explicit idfs or embeddings parameters to main fns. I'll pause there to await any feedback you might have ... 👍

Bisaloo · 2024-12-05T11:38:02Z

But prior to that, you could just try running it on a local corpus to see what you get, and then passing the results as explicit idfs or embeddings parameters to main fns. I'll pause there to await any feedback you might have ... 👍

This is what I have tried here: https://github.com/epiverse-connect/epiverse-pkgmatch, with the corpus from https://epiverse-connect.r-universe.dev/.

It currently only use embeddings, as I didn't know how to compute idfs on this corpus. I managed to adapt the script you shared above (thanks!) to compute idfs but I didn't have time to check yet if it give better results than embeddings alone.

Overall, the results are reasonably good. There is nothing outrageous in the results and we often find some expected results but we also miss some expected results.

We will meet with my colleagues soon to determine if we go with this approach or something more custom.

Bisaloo · 2024-12-05T11:40:24Z

I think asking the users to generate their own embeddings & idfs is good enough. The main ask was to have a mirror function of pkgmatch_embeddings_from_pkgs() for the idfs argument.

The pkgmatch_generate_data() could go one step further but I would only go for it if the maintainance burden is low.

I think it's fair to expect users who want to use their own corpus to be able to jump through a couple of well-documented hoops.

mpadge · 2024-12-05T12:04:06Z

It will actually lower the maintenance burden, because it can simply be called to do complete local updates, and replace current script. This new function is I think a sensible generalization of that. I ran it locally on < 10 repos, and the whole thing only took a minute or two, so it seems to work pretty well. I'll nevertheless more explicitly address what you're asking for as well. Thanks!

Bisaloo · 2024-12-05T12:33:43Z

Would you like to open a PR from newdata for more specific feedback?

mpadge added a commit that referenced this issue Dec 5, 2024

start 'pkgmatch_generate_data' fn for #86

00c1b12

mpadge added a commit that referenced this issue Dec 5, 2024

restructure data-generate.R for #86 into module sub-fns

1171221

mpadge added a commit that referenced this issue Dec 5, 2024

fix return val of 'generate_data' fn #86

2be5278

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to compute idfs for a custom set of packages #86

How to compute idfs for a custom set of packages #86

Bisaloo commented Nov 27, 2024

mpadge commented Nov 27, 2024

mpadge commented Dec 5, 2024

Bisaloo commented Dec 5, 2024

Bisaloo commented Dec 5, 2024

mpadge commented Dec 5, 2024

Bisaloo commented Dec 5, 2024

How to compute idfs for a custom set of packages #86

How to compute idfs for a custom set of packages #86

Comments

Bisaloo commented Nov 27, 2024

mpadge commented Nov 27, 2024

mpadge commented Dec 5, 2024

Bisaloo commented Dec 5, 2024

Bisaloo commented Dec 5, 2024

mpadge commented Dec 5, 2024

Bisaloo commented Dec 5, 2024