Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privacy preserving prediction of molecular properties #90

Closed
janweinreich opened this issue Dec 4, 2023 · 12 comments
Closed

Privacy preserving prediction of molecular properties #90

janweinreich opened this issue Dec 4, 2023 · 12 comments
Assignees
Labels
📁 Concrete ML library targeted: Concrete ML 💵 Grant accepted This project received a grant from the Zama team

Comments

@janweinreich
Copy link

janweinreich commented Dec 4, 2023

Category: Application

Overview

We at VaultChem, a startup combining encryption and chemistry, aim to use Zamas Concrete ML library for FHE inference of molecular properties. Consider a scenario where pharma company A is interested in predicting properties of candidate molecules, before phase I clinical trials.

In particular, understanding the processes of absorption, distribution, metabolism, and excretion (ADME) is crucial for determining a drug candidate's concentration profile at its action site, significantly impacting the drug's effectiveness.

However, A does not have sufficient data available for reliable ML predictions. Instead, A will securely obtain predictions on molecular data from an untrusted party B that owns a secret database and an ML model with sufficient training data. This is only possible using FHE to guarantee party A will not reveal the secret query to party B.

We will simulate this scenario using open-source chemistry datasets. We will provide tools (based on cheminformatics rdkit and concrete-ml) to give an end-to-end solution to the problem of privacy-preserving prediction. We will deploy the app to hugging face (similar to the FHE image filter) and provide detailed tutorials/notebooks that explain each step. Finally, in comparing against sklearn implementations, we will also investigate the accuracy versus computational cost trade-off as computational screening in cheminformatics may require fast predictions on thousands of molecules. We provide an outlook on how to account for increased computational costs due to FHE inference in the case of molecular data.

  • Total Reward: 3500 € (split by milestones)

  • Description

    • The main goals are, implementing a toolchain that allows users to input molecular data and use Zamas’ Concrete ML library to predict the properties of these molecules. The goal is to show the feasibility of concrete ml in the field of chemistry and pharmaceutical data as well as making a demo available on Huggingface.
  • Milestones

    1. Data processing and identification of best-performing model in Concrete ML
      1. Time estimation: 3 days
      2. Reward: 2000 €
      3. Tasks:
        1. Prepare processing of chemical data using RdKit for use with ML.
        2. Compare accuracy as well as prediction time for various Concrete ML-built in Models, identify models suitable for deployment, and train for FHE execution
      4. Deliverables: Scripts for data preparation and training. Save the identified best model.
    2. Model deployment on huggingface
      1. Time estimation: 2 days
      2. Reward: 1000 €
      3. Tasks:
        1. Deploy the model to our huggingface space with simple user input following the same logic as in the FHE image filter. The user inputs the molecules in SMILES format ([Simplified molecular-input line-entry system](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system))
        2. Add visual elements that explain FHE in case of molecular property prediction
      4. Deliverables: All files needed for the huggingface space, adapted to run the model identified in the previous step. Interface written in Gradio. Input molecules from the user will be displayed graphically. In addition, include a visual representation of the FHE execution analogous to the sketch on ZAMAs FHE image filter on huggingface.
    3. Documentation and Tutorials
      1. Time estimation: 1 day
      2. Reward: 500 €
      3. Task:
        1. A Jupyter notebook, containing code examples and illustrations, showcasing the use of FHE for chemical data.
        2. Discussion on model accuracy versus computational costs for predictions.
      4. Deliverables: Notebooks with mentioned content
  • References

@zama-bot
Copy link

zama-bot commented Dec 4, 2023

Hello janweinreich,

Thank you for your bounty proposition! Our team will review and add comments in your issue! In the meantime:

  1. Join the FHE.org discord server for any questions (you’ll find a dedicated #zama-bounty-program channel).
  2. Ask questions privately: [email protected].

Talk soon,

@aquint-zama
Copy link
Collaborator

Good news @janweinreich, this bounty is accepted!
You could start the work and as soon as you have a milestone complete, ping us on discord to review it and reward the corresponding amount.

@janweinreich
Copy link
Author

thank you getting started right away!

@janweinreich
Copy link
Author

Work is in full progress! However, we wanted to notify you that we are considering pivoting to a different target:
Instead of predicting the toxicity of a compound, we would like to predict properties as published in this paper
(https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.3c00160)

For instance, how much of a substance is absorbed in the liver? The reasons we want to change the dataset are:

  1. Some concerns about it is wise to build a model that can easily make a wrong prediction (nontoxic) on a toxic compound
  2. Poor quality of the dataset we were planning to use originally
  3. We can use Linear models for the other dataset with reasonable accuracy (+ much faster for FHE)

All other points of the proposal would remain unchanged!

@aquint-zama
Copy link
Collaborator

It's ok with your updates, could you update the main issue to reflect this changes?
Thank you 🙏

@janweinreich
Copy link
Author

thank you, I made the minor changes to the main post of this issue. Most of the code is there we just need to clean and document it well. Will contact you as soon as this is done!

@bcm-at-zama
Copy link

Hey Jan

Regarding the HF space: in general, it’s already very nice, very clear, very straight to the point, it’s an excellent bounty!

  • could you point on https://github.com/zama-ai/concrete-ml instead of docs.zama.ai/concrete-ml, when you speak about Concrete ML, please? (except when you explicitly want to link to the documentation)
  • if you want an explanation link for FHE, maybe https://fhe.org/resources/ would be a good link
  • I like you graphics! maybe they are a bit too small, the fontsize looks much smaller than the HF fontsize
  • I miss the “expected value”, to check that things happen correctly: would it be possible to add it somewhere, or to say if we have the expected result? eg, I tested HLM with Input molecule CC(=O)OC1=CC=CC=C1C(=O)O [I am not even sure if it was Aspirin or Ibuprofen, maybe you should log it] and I get “The Molecule CC(=O)OC1=CC=CC=C1C(=O)O has a HLM value of 0.31 (mL/min/kg)“, but I don’t know if it’s good or not

Regarding https://github.com/vaultchem/molvault:

  • there are some “ZAMAs” or “concrete-ml” typos: it’s “Zama” and “Concrete ML” (or Concrete-ML if you prefer) please. In the README, in the notebooks, maybe a grep would be needed :D
  • could you point on https://github.com/zama-ai/concrete-ml instead of docs.zama.ai/concrete-ml, when you speak about Concrete ML, please? (except when you explicitly want to link to the documentation)
  • I like your https://github.com/vaultchem/molvault/blob/main/examples/tutorial/tutorial.ipynb ! I would link to it in the README (you currently mention it without a link)
  • you seem to be tight to concrete-ml==1.3.0 today; if I were you, I would have a try with the fresh 1.4 Concrete ML, since tree-based models are like twice faster with it

Once again: very nice work that you’ve done, I can’t wait we can let our marketing publish about it!
Cheers

@janweinreich
Copy link
Author

Thanks for your feedback!

  • I will get to the points as soon as possible. All the aspects about references and visual adjustments will be fixed.

  • About the expected value it is difficult to say because clearly the database is small. If a user tests an arbitrary molecule we just do not have any reference value to compare with. But I can add a few comparisons with molecules for which we have the data

  • About the graphics, it was suggested to me to simplify the graphics further and have it as the first element before the text because this is more likely to catch attention

  • Will add a reference to the tutorial and also mention in the respective section that the timings reported were with version 1.3.0 and it is recommended to update

@bcm-at-zama
Copy link

Thanks!

  • but for the HF space, in the inputs, we can only put Aspirin or Ibuprofen, so you can already find expected values, no?
  • for the graphic: sure, if you want our inputs, you'll say
  • for the timings and requirements.txt, you don't want to redo with 1.4.0 CML? it's too long to do? really things should be much faster

@janweinreich
Copy link
Author

janweinreich commented Feb 2, 2024

Sure I can see if these molecules are in the test set and add the reference values.

No problem, rerunning the models as I write with new version of concrete-ml. Timings will be updated accordingly

@janweinreich
Copy link
Author

Thank you for your patience @bcm-at-zama !

Fixes

  1. Fixed spelling

  2. The "popular" molecules aspirin and ibuprofen are not in the dataset: we cannot compare against these values. Instead, I added a comparison of the predicted values with all the values in the dataset (see screenshot) to allow insight if the predicted value is large or small

Screenshot from 2024-02-12 10-04-36

  1. Added timing comparison for CML 1.3 and 1.4 (see figure). Indeed 1.4 provides a significant speedup for XGB, however, given in our case a linear model is just as accurate we stay with this. The figure shows the per-element prediction time (averages over 10 samples) as a function of Depth for different numbers of estimators. requirements.txt updated

timing

(script for timing test,
https://github.com/vaultchem/molvault/blob/main/examples/huggingface/fit_fhe/timing_test/timing_FHE.py)

  1. Updated the HF space to CML 1.4. To allow users to test the XGB models they were uploaded to the repo
    https://github.com/vaultchem/molvault/tree/main/examples/huggingface/app/models
    and the REAMDE was updated accordingly - now also containing a link to the tutorial

Questions

  1. The repo uses Zamas' Code, although not modified.
    Still, according to https://github.com/zama-ai/concrete-ml/blob/main/LICENSE
    will we have to include a copyright notice from Zama in the repository?

The idea was to publish the repo of the bounty under CC-BY. We want to make sure to comply with Zamas' policy, including future developments that may lead to commercial use.

  1. The demo is a chance for the startup (VaultChem) to get in touch with potential customers. Assuming the bounty is approved, how/when did Zama - if at all - plan to share the repository and the link to the demo? If possible, could you share a draft for the post ahead of time!

Thank you!

@bcm-at-zama
Copy link

  • For the expected value for Aspirin and Ibuprofen, sorry but it's still not clear to me. You can't find the real values somewhere on the internet, and print them for comparison with the prediction?
  • For the rest: thanks

@github-project-automation github-project-automation bot moved this from Grants to Awarded Contributions in Zama Bounty and Grant Program Overview May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📁 Concrete ML library targeted: Concrete ML 💵 Grant accepted This project received a grant from the Zama team
Projects
Status: Awarded Contributions
Development

No branches or pull requests

5 participants