Skip to content

Latest commit

 

History

History
30 lines (24 loc) · 1.5 KB

README.md

File metadata and controls

30 lines (24 loc) · 1.5 KB

LongCOVID

A repository to share code for long COVID predictions based on a random forest classifier as well as a univariate association analysis of proteomic features to longCOVID labels.

Requirements

python 3.7.4 scikit-learn 1.1.3 pandas 1.5.2 scipy 1.9.3 numpy 1.23.5 shap 0.41.0 statsmodels 0.13.5 openpyxl 3.0.10

Required input data

The input data required to execute these scripts can be obtained from image. Please include these in a folder Data. This should comprise:

  • Proteomics_Clinical_Data_220902_Acute_plus_healthy_v5.xlsx
  • Proteomics_Clinical_Data_220902_6M_timepoint_v4.xlsx
  • Proteomics_Clinical_Data_220902_Labels_v2.xlsx
  • Table S2 Biological protein cluster compositions.xlsx

Execution

We provide the data splits used in partitions. Relevant label dictionaries need to be generated based on the label data file listed above. Run the file prediction_RF.py to generate model predictions, association analysis either for individual proteomic features, or clusters thereof can be obtained using associationAnalysis.py, and associationClusters.py respectively. In combineInterpreations.py we combine the SHAP analysis results of multiple cross validation folds.

Acknowledgements

This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 813533 (K.B.).