Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress data files to save space #94

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jamestwebber
Copy link

This PR was motivated by a discussion about PEP 639 which might recommend using this package in build tools. In that context, package size is a big concern.

The package is about 1.2 MB installed, and the majority of that is due to scancode-licensedb-index.json. I just gzipped the data file and modified the code appropriately to save space--the json compresses to <10% of its original size and the tests all pass.

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
zip from wheels is pretty weak for compression indeed.
I wonder if we can get even better using lzma which is builtin since 3.3?
Also, I would prefer avoiding having the compressed json in Git if at all possible... to keep proper diffs and keep the repo as small as can be.

What about this:

  • modify the code to accept either the json or lzma compressed input
  • add the compressed version to .gitignore
  • update the build to use flot https://github.com/aboutcode-org/flot with a small prebuild script that will do the compression as part of the build

@jamestwebber
Copy link
Author

That all sounds reasonable but I don't have time at the moment (this version was super easy 😅), I can try to make those changes next week, or someone else can take over.

@pombredanne
Copy link
Member

Actually in the context of https://discuss.python.org/t/pep-639-round-3-improving-license-clarity-with-better-package-metadata/53020/1 I think we can do better.

We can build a minimal license-expression-mini wheel that would contain a subset of the license data ... say just the essential license keys in a list of tuples with no keys.

$ wget https://raw.githubusercontent.com/nexB/license-expression/c20b3f605daefc7cd9e4dc7b34e95280f206def3/src/license_expression/data/scancode-licensedb-index.json
$ ll
total 868
drwxrwxr-x  2 foobar foobar   4096 May 10 17:56 ./
drwxrwxrwx 84 foobar foobar  12288 May 10 17:56 ../
-rw-rw-r--  1 foobar foobar 866178 May 10 17:56 scancode-licensedb-index.json
$ python
Python 3.10.13 (main, Jan  6 2024, 18:44:10) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> j=json.load(open("scancode-licensedb-index.json"))
>>> mini=[]
>>> for l in j:
...  l.pop("json")
...  l.pop("yaml")
...  l.pop("html")
...  l.pop("license")
...  mini.append(list(l.values()))
>>> with open("mini.json", "w") as o:
...  o.write(json.dumps(mini, separators=(',', ':'))
... 
... )
... 
>>> 
$ ll
total 1056
drwxrwxr-x  2 foobar foobar   4096 May 10 18:00 ./
drwxrwxrwx 84 foobar foobar  12288 May 10 17:56 ../
-rw-rw-r--  1 foobar foobar 191708 May 10 18:00 mini.json
-rw-rw-r--  1 foobar foobar 866178 May 10 17:56 scancode-licensedb-index.json
$ xz -z -k -9 mini.json 
$ ll
total 1080
drwxrwxr-x  2 foobar foobar   4096 May 10 18:01 ./
drwxrwxrwx 84 foobar foobar  12288 May 10 17:56 ../
-rw-rw-r--  1 foobar foobar 191708 May 10 18:00 mini.json
-rw-rw-r--  1 foobar foobar  23704 May 10 18:00 mini.json.xz
-rw-rw-r--  1 foobar foobar 866178 May 10 17:56 scancode-licensedb-index.json

It would be down to 23K of compressed data :)
I still would want to use flot to generate multiple wheels from the same repo and keep the current wheel as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants