Skip to content

JaneGovan/MEDSQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataset Sources

Our EduDiag and ScqTest datasets is constructed from two resources: MIMIC-CXR and Chest ImaGenome.

source

Statistics

Abnormal zones

The proportion of each abnormal zones in the entire datasets.

abnormal_zones

Reponses

The average number of words per response in each round for every template in the bilingual dataset EduDiag.

length_tokens

Options

The distribution of options for single-choice questions in the bilingual dataset ScqTest.

percent_options

Dataset Cases

EduDiag

English

EduDiag_case_en

Chinese

EduDiag_case_zh

ScqTest

English

ScqTest_case_en

Chinese

ScqTest_case_zh

Dataset Details

Each data in EduDiag contains image, report_en, qa_en, report_zh, qa_zh. image records the information contained in the patient's chest X-ray, image_path indicates the path of the image, reason_for_exam contains the patient's medical history and the purpose of the examination, bbox lists all anatomical locations with abnormalities and uses focuses to indicate specific abnormalities, and the remaining fields are directly derived from Chest ImaGenome. report_en and report_zh are cleaned English and Chinese medical reports respectively. qa_en and qa_zh contain multi-round question and answer of bilingual templates.

{
  "image": {
    "image_id": "19d2573b-bbbb5192-d992c5a2-7b72f28b-b6182646",
    "image_path": "img/19d2573b-bbbb5192-d992c5a2-7b72f28b-b6182646.jpg",
    "viewpoint": "AP",
    "patient_id": 19422157,
    "study_id": 53040876,
    "gender": "F",
    "patient_age": "40-50",
    "reason_for_exam": "A woman with severe upper abdominal pain s/p endoscopy.  // evaluate for free air.",
    "bbox": [
      {
        "bbox_name": "left lower lung zone",
        "original_x": 1364,
        "original_y": 1882,
        "original_width": 777,
        "original_height": 723,
        "x": 119,
        "y": 138,
        "width": 57,
        "height": 53,
        "focuses": [
          "atelectasis",
          ...
        ]
      },
      ...
    ]
  },
  "report_en": "...",
  "qa_en": [
    [
      {
        "Question": "...",
        "Answer": "..."
      },
      ...
    ],
    ...
  ],
  "report_zh": "...",
  "qa_zh": [...]
}

Each data of ScqTest contains image, report_en, test_en, report_zh, test_zh. image records the information contained in the patient's chest X-ray in the same way. test_en and test_zh are English and Chinese single-choice question banks.

{
  "image": {...},
  "report_en": "...",
  "test_en": [
    [
      {
          "Question": "...",
          "A": "...",
          "B": "...",
          "C": "...",
          "D": "...",
          "GT": "D"
      },
      ...
    ],
    ...
  ],
  "report_zh": "...",
  "test_zh": [...]
}

Access

Our datasets are available in the data directory. Both dataset EduDiag and dataset ScqTest are stored in json format. The loading and conversion methods are as follows:

from models.utils.convert_data import seed, read_json, convert_for_gen

# Set random seed
seed(42)

# Load data
data = read_json('data/EduDiag.json')

# Convert the original data into training format
train_dataset, _, _ = divide_data('data/EduDiag.json')
data = convert_for_gen(train_dataset, 'en')
to_trained_data(data, 'data/train_data.json')

Fine-tuning

Before running the code, you need to install the following dependencies:

pip install -r requirements.txt

Fine-tune using our dataset:

cd MEDSQ
bash models/scripts/finetune_lora.sh

Inference

Use the following code for inference, or run inference.py.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


instruction = 'FINDINGS:The lungs are clear of consolidation. Linear left basilar opacity is most likely atelectasis versus scarring. The cardiomediastinal silhouette is within normal limits. Median sternotomy wires are again noted. There is no free air below the diaphragm.IMPRESSION:No acute cardiopulmonary process. No free intraperitoneal air.\nBased on the above information, answer the question.\nQuestion: Please provide detailed and comprehensive diagnostic results.'
saved_model_path = 'MEDSQ'
tokenizer = AutoTokenizer.from_pretrained(saved_model_path)
model = AutoModelForCausalLM.from_pretrained(saved_model_path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)
response = model.chat(tokenizer, instruction, history=None, eos_token_id=2, pad_token_id=2, temperature=0.3, top_p=0.8, max_length=None, max_new_tokens=512)[0]
print(response)

Comparsion

Ablation study of filtering operations in abnormal area localization scenario.

human_score_detection

Assessment of medical students' satisfaction with individual model responses.

human_score_detection

Evaluation of the relevance of generated text answers to the Ground Truth.

human_score_answer

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published