Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce package(s) size #87

Closed
pawelru opened this issue Nov 14, 2024 · 12 comments
Closed

Reduce package(s) size #87

pawelru opened this issue Nov 14, 2024 · 12 comments
Assignees
Labels

Comments

@pawelru
Copy link
Collaborator

pawelru commented Nov 14, 2024

Have a look at the formatters:

❯ R CMD BUILD formatters
* checking for file ‘formatters/DESCRIPTION’ ... OK
* preparing ‘formatters’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘formatters_0.5.9.9003.tar.gz’

❯ tar -tzvf formatters_0.5.9.9003.tar.gz | sort -k3 -hr
-rw-r--r--  0 ruckip staff 1847812 Dec  6  2023 formatters/man/figures/logo.png
-rw-r--r--  0 ruckip staff  925622 Nov  6  2023 formatters/data/ex_advs.rda
-rw-r--r--  0 ruckip staff  629211 Nov  6  2023 formatters/data/ex_adqs.rda
-rw-r--r--  0 ruckip staff  493724 Nov  6  2023 formatters/data/ex_adlb.rda
-rw-r--r--  0 ruckip staff   58553 Jun 10 11:05 formatters/R/pagination.R
-rw-r--r--  0 ruckip staff   58457 Nov  6  2023 formatters/data/ex_adae.rda
-rw-r--r--  0 ruckip staff   56998 Jun 10 11:05 formatters/R/tostring.R
-rw-r--r--  0 ruckip staff   50367 Nov  6  2023 formatters/data/ex_admh.rda
-rw-r--r--  0 ruckip staff   48949 Nov  6  2023 formatters/data/ex_adcm.rda
-rw-r--r--  0 ruckip staff   45794 Oct 11 10:16 formatters/R/matrix_form.R
-rw-r--r--  0 ruckip staff   43801 Nov  6  2023 formatters/data/ex_adaette.rda
-rw-r--r--  0 ruckip staff   42991 Nov  6  2023 formatters/data/ex_adrs.rda
-rw-r--r--  0 ruckip staff   42770 Nov  6  2023 formatters/data/ex_adtte.rda
-rw-r--r--  0 ruckip staff   28158 Aug 13 14:33 formatters/tests/testthat/test-formatters.R
-rw-r--r--  0 ruckip staff   25210 Nov 14 18:02 formatters/inst/doc/formatters.html
-rw-r--r--  0 ruckip staff   23892 Jun 10 11:05 formatters/R/mpf_exporters.R
(...)

when untared:

❯ total_size=$(du -sk . | awk '{print $1}')
find . -type f -exec du -sk {} + | awk -v total_size=$total_size '{printf "%s %d KB (%.2f%%)\n", $2, $1, ($1/total_size)*100}' | sort -k2,2nr
./man/figures/logo.png 1808 KB (37.05%)
./data/ex_advs.rda 904 KB (18.52%)
./data/ex_adqs.rda 616 KB (12.62%)
./data/ex_adlb.rda 484 KB (9.92%)
./R/pagination.R 60 KB (1.23%)
  • 37% of package size is consumed by the logo alone!
  • looking into the codebase of that repo - we are not using ex_ datasets at all. But it might be used elsewhere. There are few packages that are using ex_adlb but I cannot find any usage of ex_advs and ex_adqs.

Adding this in general task repo to repeat the analysis for multiple packages and find a common solution.

It's important because of:

  • R CMD CHECK note about size
  • WebR with package downloads on the client (website user) side so we aim to have small and fast installable packages. I'm seeing two ways of doing this: limiting deps (not discussed here) and reducing the size (this issue). Currently it's neither good or bad. Have a look at the below extract from a random shinylive teal app in the tlg-catalog (devtools in chrome - network tab - open shinylive tab). This is filtered to downloaded .tgz files and order by size.
    image
@llrs-roche
Copy link

llrs-roche commented Dec 10, 2024

Reducing the logo file size is relatively easy (using an online tool reduced the size by 70% to ~500KB). I also think that the size of the logo should be checked, as I found that for formatters is quite big 1299x1501 (compared it with teal logo at teal/man/figures/teal.png of 278x321 it is >20 times bigger).

Dimensions of the logos found:

teal/man/figures/teal.png
[1] 278 321
rtables/man/figures/logo.png:
[1] 1501 1501
formatters/man/figures/logo.png:
[1] 1299 1501

> file.size("../teal/man/figures/teal.png")
[1] 79859
> file.size("../rtables/man/figures/logo.png")
[1] 1929788
> file.size("../formatters/man/figures/logo.png")
[1] 498770

I think that only compressing can be enough, but we should reduce the file size of the logos for rtables and formatters.
Note also that teal and formatters' logos have the same proportion but rtables one is a square.

Checking for objects/datasets not used can be more complicated because it is harder to automate as they might be used internally or by other repositories, and I don't know a good method to automatically extract which objects are datasets from an R package (checking https://pharmaverse.r-universe.dev/datasets, reports 33 datasets from teal.* packages but it is harder to explore) .

llrs-roche added a commit to insightsengineering/formatters that referenced this issue Dec 10, 2024
[It has been
detected](insightsengineering/nestdevs-tasks#87)
that the formatters logo is quite big in comparison to other packages.
This PR compress the logo so it no longer takes so much of the package
size.
@llrs-roche
Copy link

llrs-roche commented Dec 13, 2024

Logos were compressed to reduce size. After the two compressors only a third of the size was left ~500Kb. As described above, logos have different dimension size, reducing it to the minimum would help too to avoid bloating packages.

Manual check of datasets & usage

Go to r-universe.dev/datasets and filter by "teal." => There are 33 datasets
11 from formatters.
Those with an x are those already reviewed. If there is no usage at all edit the line to add "No usage" or something similar.
Note, Github search is not reliable when filtering out some code ("ex_advs -tmc_ex_advs"): some results that only have ex_advs (as per a search with just "ex_advs") will disappear when negating tmc_ex_advs.

From the datasets name (and size) there seems to be some data duplication or reexport (for example rADTTE is in 4 different packages). One way to reduce package size would be to move all these datasets to a single data package.

@m7pr m7pr self-assigned this Dec 13, 2024
@llrs-roche llrs-roche self-assigned this Dec 13, 2024
@m7pr
Copy link

m7pr commented Dec 13, 2024

From the datasets name (and size) there seems to be some data duplication or reexport (for example rADTTE is in 4 different packages). One way to reduce package size would be to move all these datasets to a single data package.

@llrs-roche in the past we had all datasets in one package, but we decided to delete this package and just move data directly to packages where the data is used. That's why you see some duplications

@m7pr
Copy link

m7pr commented Dec 13, 2024

However for cases like this, where multiple packages already depend on each other, and they all are using the same dataset we can delete rADAE from teal.modules.general and reuse this data from teal.data

image

@pawelru
Copy link
Collaborator Author

pawelru commented Dec 13, 2024

@llrs-roche thanks for your work on this!
Would you be able to check the logos for other packages as well? E.g. logo for rlistings is also quite big ~2MB. If we already started doing this - let's do this for all pkgs and close this topic for good.

@llrs-roche
Copy link

@pawelru there are ~40 logos to deal with on the organization, and I don't have a script to automate this...
I'll keep it in mind for next week as I want to work on insightsengineering/teal#1323 to give time to the team to give feedback before we take holidays.

@pawelru
Copy link
Collaborator Author

pawelru commented Dec 13, 2024

Yes - that's somewhat expected. I don't think it would be ~40 - rather closer to ~20 but still quite a lot of work. I'm aware. Hopefully not all of them would require action but I just wanted do this completely and never come back to it.

m7pr added a commit to insightsengineering/teal.modules.hermes that referenced this issue Dec 13, 2024
In the spirit of package size reduction
insightsengineering/nestdevs-tasks#87
I replaced all the calls for tmg datasets with their respective datasets
from teal.data.
This way we no longer need to have datasets copies in tmg.

---------

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
m7pr added a commit to insightsengineering/teal.transform that referenced this issue Dec 13, 2024
In the spirit of package size reduction
insightsengineering/nestdevs-tasks#87
I replaced all the calls for tmg datasets with their respective datasets
from teal.data.
This way we no longer need to have datasets copies in tmg.
@llrs-roche llrs-roche mentioned this issue Dec 13, 2024
3 tasks
@m7pr
Copy link

m7pr commented Dec 16, 2024

Alrighty, @pawelru and @llrs-roche I think we can reduce logos during #93

And when it comes to datasets - 3 packages can be simplified and reuse data from teal.data #87 (comment)

Lastly @llrs-roche I need to know your recommendations on those formatters datasets
image

m7pr added a commit to insightsengineering/teal.modules.general that referenced this issue Dec 16, 2024
In the spirit of package size reduction
insightsengineering/nestdevs-tasks#87
I replaced all the calls for tmg datasets with their respective datasets
from teal.data.
This way we no longer need to have datasets copies in tmg.

---------

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
@llrs-roche
Copy link

Those datasets are from formatters, and some overlap by name with those on tmc. However, between tmc and formatters there are some differences often tmc datasets have fewer rows and some have fewer columns. Sometimes a similar dataset is present on tern. formatters data is used by other packages like rlistings on the vignettes.

There is no dependency controlled by us shared between formatters and tmc, so I wouldn't remove the data from either of them to add a new dependency from one to the other. This would complicate the installation for minimal gains.

@m7pr
Copy link

m7pr commented Dec 16, 2024

Alrighty, thanks @llrs-roche for the review. I guess we can close this issue for now?

@llrs-roche
Copy link

Yes,, we can close and keep track of the logos size on #93

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants