Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkgmatch: Find R Packages Matching Either Descriptions or Other R Packages #671

Open
14 of 30 tasks
mpadge opened this issue Nov 7, 2024 · 18 comments
Open
14 of 30 tasks
Assignees

Comments

@mpadge
Copy link
Member

mpadge commented Nov 7, 2024

Submitting Author Name: Mark Padgham
Submitting Author Github Handle: @mpadge
Repository: https://github.com/ropensci-review-tools/pkgmatch
Version submitted: 0.4.2
Submission type: Standard
Editor: @MargaretSiple-NOAA
Reviewers: @agricolamz

Due date for @agricolamz: 2025-01-31

Archive: TBD
Version accepted: TBD
Language: en


  • Paste the full DESCRIPTION file inside a code block below:
Package: pkgmatch
Title:  Find R Packages Matching Either Descriptions or Other R Packages
Version: 0.4.2
Authors@R: c(
    person("Mark", "Padgham", , "[email protected]", role = c("aut", "cre"),
           comment = c(ORCID = "0000-0003-2172-5265")),
    person("Davis", "Vaughan", , "[email protected]", role = c("ctb"))
    )
Description: Find R packages matching either descriptions or other R packages.
License: MIT + file LICENSE
URL: https://docs.ropensci.org/pkgmatch/,
    https://github.com/ropensci-review-tools/pkgmatch
BugReports: https://github.com/ropensci-review-tools/pkgmatch/issues
Imports:
    brio,
    checkmate,
    cli,
    curl,
    dplyr,
    fs,
    httr2,
    memoise,
    pbapply,
    Rcpp,
    rvest,
    tibble,
    tidyr,
    tokenizers,
    treesitter,
    treesitter.r,
    vctrs
Suggests:
    gert,
    hms,
    httptest2,
    jsonlite,
    piggyback,
    pkgbuild,
    rappdirs,
    roxygen2,
    testthat (>= 3.0.0),
    withr,
    knitr,
    rmarkdown
LinkingTo:
    Rcpp
Depends: R (>= 3.5.0)
NeedsCompilation: yes
Encoding: UTF-8
Language: en-GB
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
Config/testthat/edition: 3
VignetteBuilder: knitr

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • data retrieval
    • data extraction
    • data munging
    • data deposition
    • data validation and testing
    • workflow automation
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • field and lab reproducibility tools
    • database software bindings
    • geospatial data
    • text analysis
    • rOpenSci tools
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):

Data retrieval, because the package includes code to generate language model (LM) embeddings from all R packages retrieved from both CRAN and rOpenSci package repositories. Wrapper because LM embeddings are generated by wrapping interface to ollama software. Plus I've inserted a new, one-off category of "rOpenSci tools" for internal, staff-curated packages.

  • Who is the target audience and what are scientific applications of this package?

Beyond internal rOpenSci use, target audiences are (1) entirely general audience of those interested in searching R packages using either text or code input, and (2) package developers, who can use this package to identify similar packages or functions to code they might be working on.

No, not at all. There are to my knowledge two other R packages for interfacing with LMs: tidyllm and elmer. Both of these are general interfaces to LM API endpoints, while this package specifically uses LM outputs to identify best-matching packages.

Not applicable.

  • If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

  • Explain reasons for any pkgcheck items which your package is unable to pass.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • Do you intend for this package to go on CRAN?

  • Do you intend for this package to go on Bioconductor?

  • Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options
  • The package is novel and will be of interest to the broad readership of the journal.
  • The manuscript describing the package is no longer than 3000 words.
  • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
  • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
  • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
  • (Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

@ropensci-review-bot
Copy link
Collaborator

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

@ropensci-review-bot
Copy link
Collaborator

🚀

Editor check started

👋

@ropensci-review-bot
Copy link
Collaborator

Checks for pkgmatch (v0.4.2)

git hash: f12ad732

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 79.9%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 545
internal pkgmatch 204
internal utils 25
internal stats 12
internal tools 5
imports fs 47
imports checkmate 20
imports dplyr 17
imports memoise 13
imports treesitter 8
imports httr2 7
imports pbapply 5
imports brio 2
imports rvest 2
imports tibble 2
imports tokenizers 2
imports treesitter.r 1
imports cli NA
imports curl NA
imports Rcpp NA
imports tidyr NA
imports vctrs NA
suggests gert 2
suggests jsonlite 2
suggests hms 1
suggests piggyback 1
suggests httptest2 NA
suggests pkgbuild NA
suggests rappdirs NA
suggests roxygen2 NA
suggests testthat NA
suggests withr NA
suggests knitr NA
suggests rmarkdown NA
linking_to Rcpp NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

lapply (46), data.frame (31), which (27), names (24), vapply (23), length (19), nrow (17), grep (16), c (15), list (15), paste0 (14), seq_len (12), character (11), gsub (11), as.integer (10), by (10), ncol (10), tryCatch (10), unname (10), url (10), order (9), grepl (7), colnames (6), for (6), integer (6), readRDS (6), unlist (6), version (6), basename (5), colSums (5), format (5), ifelse (5), raw (5), seq (5), seq_along (5), tempdir (5), all (4), as.Date (4), asNamespace (4), do.call (4), is.null (4), strsplit (4), system (4), attr (3), difftime (3), getOption (3), log (3), matrix (3), proc.time (3), read.dcf (3), sqrt (3), table (3), unique (3), as.character (2), cbind (2), floor (2), is.na (2), ls (2), match (2), min (2), nzchar (2), options (2), regmatches (2), rowSums (2), sum (2), units (2), any (1), apply (1), as.matrix (1), class (1), cut (1), drop (1), file (1), gregexpr (1), list.files (1), logical (1), mean (1), new.env (1), parseNamespaceFile (1), paste (1), rank (1), rbind (1), readline (1), regexpr (1), rep (1), sort (1), switch (1), Sys.Date (1), Sys.getenv (1), Sys.time (1), system.file (1), tolower (1), unclass (1), vector (1)

pkgmatch

bm25_tokens_list (8), get_embeddings (7), not_null_index (7), bm25_idf (6), get_pkg_text (6), get_pkg_code (5), pkgmatch_bm25 (5), cosine_similarity (4), dl_prev_data (4), pkgmatch_embeddings_from_pkgs (4), rm_fns_from_pkg_txt (4), bm25_tokens (3), get_all_fn_descs (3), get_cache_file_name (3), get_embeddings_from_ollama (3), jina_model (3), pkgmatch_bm25_from_idf (3), pkgmatch_load_data (3), pkgmatch_treesitter_fn_tags (3), append_cols (2), attach_ns (2), bm25_idf_internal (2), bm25_tokens_internal (2), bm25_tokens_list_internal (2), days_in_this_month (2), dl_one_tarball (2), extract_tarball (2), get_calls (2), get_calls_in_functions (2), get_embeddings_intern (2), get_fn_defs_namespace (2), get_fn_descs_from_ns (2), get_local_pkg_dep_fns (2), get_local_pkg_deps (2), get_pkg_readme (2), get_pkg_text_internal (2), get_pkg_text_namespace (2), input_is_pkg (2), is_docker_sudo (2), is_windows (2), list_new_cran_updates (2), load_data_internal (2), m_list_remote_files (2), ollama_dl_jina_model (2), opt_is_quiet (2), pkg_fns_from_r_search (2), pkg_fns_from_r_search_internal (2), pkg_is_installed (2), pkg_name_from_path (2), pkgmatch_bm25_fn_calls (2), pkgmatch_bm25_fn_calls_internal (2), pkgmatch_bm25_from_idf_internal (2), pkgmatch_bm25_internal (2), pkgmatch_cache_path (2), pkgmatch_update_cran (2), append_data_to_bm25 (1), append_data_to_embeddings (1), append_data_to_fn_calls (1), apply_col_names (1), attach_base_rcmd_ns (1), attach_local_dep_namespaces (1), attach_this_pkg_namespace (1), convert_paths_to_pkgs (1), desc_template (1), extract_data_from_local_dir (1), fn_names_base (1), fn_names_rcmd (1), get_fn_defs_local (1), get_pkg_exported_fns (1), get_pkg_text_local (1), has_ollama (1), has_ollama_docker (1), has_ollama_local (1), head.pkgmatch (1), input_is_path (1), input_mentions_functions (1), make_cran_version_column (1), modify_by_lm_prop (1), ollama_check (1), ollama_has_jina_model (1), ollama_is_running (1), ollama_models (1), order_output (1), pkg_install_path (1), pkgmatch_browse (1), pkgmatch_cache_update_interval (1), pkgmatch_dl_data (1), pkgmatch_embeddings_from_text (1), pkgmatch_rerank (1), pkgmatch_similar_fns (1), pkgmatch_similar_pkgs (1), pkgmatch_update_data (1), pkgmatch_update_ropensci (1), rcmd_pkgs (1), rcpp_bm25 (1), registry_daily_chunk (1), rename_files_in_r (1), ros_registry (1), similar_pkgs_from_pkg (1), similar_pkgs_from_pkg_internal (1), similarity_embeddings (1), tok_lists_to_idfs (1), tressitter_calls_in_package (1)

fs

path (19), dir_ls (9), path_temp (7), dir_create (5), path_ext (3), file_exists (1), file_info (1), path_ext_set (1), path_real (1)

utils

installed.packages (4), lsf.str (4), data (3), packageDescription (3), prompt (3), tar (2), untar (2), browseURL (1), getFromNamespace (1), tail (1), timestamp (1)

checkmate

assert_character (7), assert_integerish (3), assert_matrix (2), assert_names (2), assert_numeric (2), check_file_exists (2), assert_list (1), assert_logical (1)

dplyr

left_join (8), rename (3), mutate (2), last_col (1), n (1), relocate (1), summarise (1)

memoise

memoise (13)

stats

dt (5), start (3), end (2), line (2)

treesitter

query_captures (3), node_text (2), parser (1), parser_parse (1), tree_root_node (1)

httr2

req_headers (2), request (2), resp_body_json (2), req_perform (1)

pbapply

pblapply (5)

tools

parse_Rd (2), Rd_db (2), CRAN_package_db (1)

brio

read_lines (2)

gert

git_clone (2)

jsonlite

read_json (2)

rvest

html_table (1), read_html (1)

tibble

new_tibble (2)

tokenizers

count_words (1), tokenize_words (1)

hms

hms (1)

piggyback

pb_download (1)

treesitter.r

language (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in C++ (5% in 2 files) and R (95% in 20 files)
  • 1 authors
  • 6 vignettes
  • no internal data file
  • 17 imported packages
  • 14 exported functions (median 14 lines of code)
  • 218 non-exported functions in R (median 12 lines of code)
  • 4 R functions (median 12 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 20 79.8
files_src 2 79.5
files_vignettes 6 96.8
files_tests 10 87.4
loc_R 1905 81.1
loc_src 91 13.6
loc_vignettes 463 74.2
loc_tests 694 77.2
num_vignettes 6 97.6 TRUE
n_fns_r 232 90.5
n_fns_r_exported 14 56.0
n_fns_r_not_exported 218 93.0
n_fns_src 4 21.1
n_fns_per_file_r 7 79.5
n_fns_per_file_src 2 27.8
num_params_per_fn 2 8.2
loc_per_fn_r 12 36.8
loc_per_fn_r_exp 14 33.6
loc_per_fn_r_not_exp 12 39.8
loc_per_fn_src 12 38.9
rel_whitespace_R 24 85.6
rel_whitespace_src 26 21.8
rel_whitespace_vignettes 20 57.6
rel_whitespace_tests 21 77.1
doclines_per_fn_exp 28 29.2
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 187 87.0

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

R-CMD-check

GitHub Workflow Results

id name conclusion sha run_number date
11727438106 docker skipped f12ad7 23 2024-11-07
11727438100 pkgcheck NA f12ad7 96 2024-11-07
11727438103 R-CMD-check success f12ad7 292 2024-11-07
11727438110 test-coverage success f12ad7 292 2024-11-07
11727438101 Update pkgmatch data NA f12ad7 66 2024-11-07

3b. goodpractice results

R CMD check with rcmdcheck

rcmdcheck found no errors, warnings, or notes

Test coverage with covr

Package coverage: 79.93

Cyclocomplexity with cyclocomp

The following function have cyclocomplexity >= 15:

function cyclocomplexity
get_pkg_readme 17

Static code analyses with lintr

lintr found no issues with this package!


Package Versions

package version
pkgstats 0.2.0.47
pkgcheck 0.1.2.63


Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

@emilyriederer
Copy link

@ropensci-review-bot assign @MargaretSiple-NOAA as editor

@ropensci-review-bot
Copy link
Collaborator

Assigned! @MargaretSiple-NOAA is now the editor

@MargaretSiple-NOAA
Copy link

@ropensci-review-bot seeking reviewers

@ropensci-review-bot
Copy link
Collaborator

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/671_status.svg)](https://github.com/ropensci/software-review/issues/671)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

@MargaretSiple-NOAA
Copy link

(just a note: I put 'seeking reviewers' before putting in my editor checks but I have not forgotten them. I started them today and will continue tomorrow.)

@MargaretSiple-NOAA
Copy link

MargaretSiple-NOAA commented Dec 4, 2024

Editor checks:

  • Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
    • Is the case for the package well made?
    • Is the reference index page clear (grouped by topic if necessary)?
    • Are vignettes readable, sufficiently detailed and not just perfunctory?
  • Fit: The package meets criteria for fit and overlap.
  • Installation instructions: Are installation instructions clear enough for human users? - Yes but see notes below.
  • Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling? Yes but see notes below. I think I just can't fully evaluate this until I can install docker and ollama on my personal computer and test the functions
  • Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
  • License: The package has a CRAN or OSI accepted license.
  • Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?

Editor comments

Nice package, @mpadge ! This will be useful for anyone working on package development or review. I wrote a few notes on what I saw during the editor check process-- they're longer than I usually write, but I figure these will probably make the reviewers' lives easier so it doesn't hurt to mention them early.

A few notes of mine from the editor check process:

  • The pkgdown page seems to indicate that the user can set the corpus to rOpenSci or to CRAN, but some functions only provide search results for rOpenSci, e.g., pkgmatch_bm25() requires the corpus to be rOpenSci-related. Some clarity about which functions apply to which package corpi (?) might be helpful.

  • I got 1 note from running devtools::check(): "Package suggested but not available for checking: 'piggyback'" -- I don't worry about these too much just wanted to flag it just in case.

  • The ollama connection gave me some trouble at the start, so I imagine future users might struggle on this. I recommend revising the documentation a little bit to make it more obvious what needs to happen with ollama before users can get started: a) Consider adding a sentence to the top of the "Get started" article that indicates that docker and ollama installs need to happen first and foremost. Here, I would link to the "ollama" article. b) Change the title of the "ollama" article to say something like "Before you begin: ollama" or "Before you begin: ollama installation", so that when people are browsing, it's very clear to them that they'll need to read that article first.

  • The beginning of the Docker section in the ollama article should indicate that users need to have Docker installed as well. I learned from this process that my institution actually doesn't have a Docker license! This may impede editors/reviewers who work at NOAA. It sounds like they're working on getting a license but for now I don't have access to it.

  • I had some trouble getting test coverage my usual way (covr::package_coverage() fails with an error about a temp file)... but devtools::test() works great). Some tests are failing but I'm pretty sure that's because I don't have ollama properly installed on my machine. It might be nice to have a one-line check in one of the early vignettes that can show people whether they are missing any of the components need to run the fns.


@mpadge
Copy link
Member Author

mpadge commented Dec 5, 2024

Thanks @MargaretSiple-NOAA for really useful feedback! The issue linked above has details of changes made in response. I'd say the most important of those for reviewers of this package is that I've added tests/README explaining that tests can be run without having ollama installed or running anywhere.

@MargaretSiple-NOAA
Copy link

@ropensci-review-bot Add @agricolamz as reviewer

@ropensci-review-bot
Copy link
Collaborator

@agricolamz added to the reviewers list. Review due date is 2025-01-02. Thanks @agricolamz for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

@ropensci-review-bot
Copy link
Collaborator

@agricolamz: If you haven't done so, please fill this form for us to update our reviewers records.

@MargaretSiple-NOAA
Copy link

@ropensci-review-bot set due date for @agricolamz to 2025-01-31

@ropensci-review-bot
Copy link
Collaborator

Review due date for @agricolamz is now 31-January-2025

@MargaretSiple-NOAA
Copy link

Hi @mpadge -- I was just revising my editor checks above after your revisions, and everything is looking good except some issues I had with devtools::test() - these are documented in the issue I logged on the pkgmatch github page (above). I don't know what they mean but I think we should have a resolution before reviewers are running similar checks.

@mpadge
Copy link
Member Author

mpadge commented Jan 9, 2025

Thanks @MargaretSiple-NOAA, I've fixed the issue linked above, so devtools::test() should now work for everybody.

@MargaretSiple-NOAA
Copy link

Thank for for addressing that , @mpadge ! The tests now run without issues.

Two more quick things:

  • pkgmatch_similar_pkgs(..., corpus = "cran") always gives a warning Error in matrix(emb[[what]], nrow = nrow, ncol = npkgs) : non-numeric matrix extent. This renders fine on the pkgdown page here so I suspect it is a computer-specific issue. Everything works as expected when corpus = "ropensci".
  • Several of the tests fail and trace back to parseNamespaceFile(pkg_name, package.lib = lp). I think it's fine to fix this when reviews are in but wanted to give you a heads-up while I'm still searching for a Reviewer 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants