Merge pull request #102 from biocypher/benchmark

Benchmark & RAG agent, architecture changes (potentially breaking → minor version increase)
biocypher · Jan 26, 2024 · 7a501c1 · 7a501c1
2 parents 2982143 + 4cbb3f0
commit 7a501c1
Show file tree

Hide file tree

Showing 64 changed files with 6,844 additions and 3,413 deletions.
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
diff --git a/.devcontainer/post-install.sh b/.devcontainer/post-install.sh
diff --git a/.gitignore b/.gitignore
@@ -11,4 +11,5 @@ __pycache__/
 .idea/
 *.env
 volumes/
-benchmark/results/*.csv
+benchmark/encrypted_llm_test_data.json
+site/
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,50 @@
+# See https://pre-commit.com for more information
+# See https://pre-commit.com/hooks.html for more hooks
+fail_fast: false
+default_language_version:
+  python: python3
+default_stages:
+  -   commit
+  -   push
+minimum_pre_commit_version: 2.7.1
+repos:
+  -   repo: https://github.com/ambv/black
+      rev: 23.7.0
+      hooks:
+        -   id: black
+  -   repo: https://github.com/timothycrosley/isort
+      rev: 5.12.0
+      hooks:
+        -   id: isort
+            additional_dependencies: [toml]
+  -   repo: https://github.com/snok/pep585-upgrade
+      rev: v1.0
+      hooks:
+        -   id: upgrade-type-hints
+  -   repo: https://github.com/pre-commit/pre-commit-hooks
+      rev: v4.4.0
+      hooks:
+        -   id: check-docstring-first
+        -   id: end-of-file-fixer
+        -   id: check-added-large-files
+        -   id: mixed-line-ending
+        -   id: trailing-whitespace
+            exclude: ^.bumpversion.cfg$
+        -   id: check-merge-conflict
+        -   id: check-case-conflict
+        -   id: check-symlinks
+        -   id: check-yaml
+            args: [--unsafe]
+        -   id: check-ast
+        -   id: fix-encoding-pragma
+            args: [--remove] # for Python3 codebase, it's not necessary
+        -   id: requirements-txt-fixer
+  -   repo: https://github.com/pre-commit/pygrep-hooks
+      rev: v1.10.0
+      hooks:
+        -   id: python-no-eval
+        -   id: python-use-type-annotations
+        -   id: python-check-blanket-noqa
+        -   id: rst-backticks
+        -   id: rst-directive-colons
+        -   id: rst-inline-touching-normal
diff --git a/DEVELOPER.md b/DEVELOPER.md
@@ -46,7 +46,7 @@ For ensuring code quality, the following tools are used:
 
 - [black](https://black.readthedocs.io/en/stable/) for automated code formatting
 
-<!-- - [pre-commit-hooks](https://github.com/pre-commit/pre-commit-hooks) for
+- [pre-commit-hooks](https://github.com/pre-commit/pre-commit-hooks) for
 ensuring some general rules
 
 - [pep585-upgrade](https://github.com/snok/pep585-upgrade) for automatically
@@ -55,28 +55,26 @@ upgrading type hints to the new native types defined in PEP 585
 - [pygrep-hooks](https://github.com/pre-commit/pygrep-hooks) for ensuring some
 general naming rules -->
 
-<!-- Pre-commit hooks are used to automatically run these tools before each commit.
+Pre-commit hooks are used to automatically run these tools before each commit.
 They are defined in [.pre-commit-config.yaml](./.pre-commit-config.yaml). To
 install the hooks run `poetry run pre-commit install`. The hooks are then
 executed before each commit. For running the hook for all project files (not
 only the changed ones) run `poetry run pre-commit run --all-files`. -->
 
-<!-- The project uses a [Sphinx](https://www.sphinx-doc.org/en/master/) autodoc
-GitHub Actions workflow to generate the documentation. If you add new code,
+The project uses [mkdocs-material](https://squidfunk.github.io/mkdocs-material/) within a GitHub Actions workflow to generate the documentation. If you add new code,
 please make sure that it is documented accordingly and in a consistent manner
 with the existing code base. The docstrings should follow the [Google style
 guide](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html).
 To check if the docs build successfully, you can build them locally by running
-`make html` in the `docs` directory. -->
-
+`mkdocs build` in the project root directory. To preview your changes run `mkdocs serve`.
+`
+<!-- TODO: doctest -->
 <!-- When adding new code snippets to the documentation, make sure that they are
 automatically tested with
 [doctest](https://sphinx-tutorial.readthedocs.io/step-3/#testing-your-code);
 this ensures that no outdated code snippets are part of the documentation. -->
 
-Documentation currently lives in the repository's
-[wiki](https://github.com/biocypher/biochatter/wiki). We will soon create a
-Sphinx-based documentation site.
+The documentation is hosted [here](https://biochatter.org/).
 
 
 ## Testing

diff --git a/README.md b/README.md
@@ -54,25 +54,16 @@ by BioChatter can be seen in use in the
 Check out [this repository](https://github.com/csbl-br/awesome-compbio-chatgpt)
 for more info on computational biology usage of large language models.
 
-# Dev Container
-
-Due to some incompatibilities of `pymilvus` with Apple Silicon, we have created
-a dev container for this project. To use it, you need to have Docker installed
-on your machine. Then, you can run the devcontainer setup as recommended by
-VSCode
-[here](https://code.visualstudio.com/docs/remote/containers#_quick-start-open-an-existing-folder-in-a-container)
-or using Docker directly.
-
-The dev container expects an environment file (there are options, but the basic
-one is `.devcontainer/local.env`) with the following variables:
-
-```
-OPENAI_API_KEY=(sk-...)
-DOCKER_COMPOSE=true
-DEVCONTAINER=true
-```
-
-To test vector database functionality, you also need to start a Milvus
-standalone server. You can do this by running `docker-compose up` as described
-[here](https://milvus.io/docs/install_standalone-docker.md) on the host machine
-(not from inside the devcontainer).
+## Developer notes
+
+If you're on Apple Silicon, you may encounter issues with the `grpcio`
+dependency (`grpc` library, which is used in `pymilvus`). If so, try to install
+the binary from source after removing the installed package from the virtual
+environment from
+[here](https://stackoverflow.com/questions/72620996/apple-m1-symbol-not-found-cfrelease-while-running-python-app):
+
+```bash
+pip uninstall grpcio
+export GRPC_PYTHON_LDFLAGS=" -framework CoreFoundation"
+pip install grpcio==1.53.0 --no-binary :all:
+```
diff --git a/benchmark/benchmark_datasets.csv b/benchmark/benchmark_datasets.csv
@@ -11,4 +11,5 @@ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470220/#appsec1,Opthamology relat
 https://bioconductor.org/packages/release/bioc/html/GSEABenchmarkeR.html,The GSEABenchmarkeR package implements an extendable framework for reproducible evaluation of set- and network-based methods for enrichment analysis of gene expression data,GSEABenchmarkeR,R package,,https://pubmed.ncbi.nlm.nih.gov/32026945/,,included in ChatGSE arxiv draft-- not sure how helpful it will be in LLM benchmarking,
 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10153208/,cell type annotation with GPT-4 in single-cell RNA-seq analysis,Reference-free and cost-effective automated cell type annotation with GPT-4 in single-cell RNA-seq analysis,,,,2023,,
 https://github.com/source-data/soda-data,text extraction from figure legends,SourceData,Unknown,Unknown,"Introduction: The scientific publishing landscape is expanding rapidly, creating challenges for researchers to stay up-to-date with the evolution of the literature. Natural Language Processing (NLP) has emerged as a potent approach to automating knowledge extraction from this vast amount of publications and preprints. Tasks such as Named-Entity Recognition (NER) and Named-Entity Linking (NEL), in conjunction with context-dependent semantic interpretation, offer promising and complementary approaches to extracting structured information and revealing key concepts. Results: We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental design, and the nature of the experimental method as an additional class. SourceData-NLP contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 papers in molecular and cell biology. We illustrate the dataset's usefulness by assessing BioLinkBERT and PubmedBERT, two transformers-based models, fine-tuned on the SourceData-NLP dataset for NER. We also introduce a novel context-dependent semantic task that infers whether an entity is the target of a controlled intervention or the object of measurement. Conclusions: SourceData-NLP's scale highlights the value of integrating curation into publishing. Models trained with SourceData-NLP will furthermore enable the development of tools able to extract causal hypotheses from the literature and assemble them into knowledge graphs.",2023,,
+https://github.com/bigscience-workshop/biomedical,General Biomedical Dataset Library (126+ datasets included),BigBIO,Huggingface dataloaders (format depends on the dataset),,https://proceedings.neurips.cc/paper_files/paper/2022/file/a583d2197eafc4afdd41f5b8765555c5-Paper-Datasets_and_Benchmarks.pdf,2022,,
 ,,,,,,,,
diff --git a/benchmark/benchmark_utils.py b/benchmark/benchmark_utils.py
@@ -0,0 +1,105 @@
+import pytest
+
+import pandas as pd
+
+
+def benchmark_already_executed(
+    model_name: str,
+    task: str,
+    subtask: str,
+) -> bool:
+    """
+    Checks if the benchmark task and subtask test case for the model_name have already
+    been executed.
+
+    Args:
+        task (str): The benchmark task, e.g. "biocypher_query_generation"
+        subtask (str): The benchmark subtask test case, e.g. "0_entities"
+        model_name (str): The model name, e.g. "gpt-3.5-turbo"
+
+    Returns:
+
+        bool: True if the benchmark task and subtask for the model_name has
+            already been run, False otherwise
+    """
+    task_results = return_or_create_result_file(task)
+    task_results_subset = (task_results["model_name"] == model_name) & (
+        task_results["subtask"] == subtask
+    )
+    return task_results_subset.any()
+
+
+def skip_if_already_run(
+    model_name: str,
+    task: str,
+    subtask: str,
+) -> None:
+    """Helper function to check if the test case is already executed.
+
+    Args:
+        model_name (str): The model name, e.g. "gpt-3.5-turbo"
+        result_files (dict[str, pd.DataFrame]): The result files
+        task (str): The benchmark task, e.g. "biocypher_query_generation"
+        subtask (str): The benchmark subtask test case, e.g. "0_single_word"
+    """
+    if benchmark_already_executed(model_name, task, subtask):
+        pytest.skip(
+            f"benchmark {task}: {subtask} with {model_name} already executed"
+        )
+
+
+def return_or_create_result_file(
+    task: str,
+):
+    """
+    Returns the result file for the task or creates it if it does not exist.
+
+    Args:
+        task (str): The benchmark task, e.g. "biocypher_query_generation"
+
+    Returns:
+        pd.DataFrame: The result file for the task
+    """
+    file_path = get_result_file_path(task)
+    try:
+        results = pd.read_csv(file_path, header=0)
+    except (pd.errors.EmptyDataError, FileNotFoundError):
+        results = pd.DataFrame(
+            columns=["model_name", "subtask", "score", "iterations"]
+        )
+        results.to_csv(file_path, index=False)
+    return results
+
+
+def write_results_to_file(
+    model_name: str, subtask: str, score: str, iterations: str, file_path: str
+):
+    """Writes the benchmark results for the subtask to the result file.
+
+    Args:
+        model_name (str): The model name, e.g. "gpt-3.5-turbo"
+        subtask (str): The benchmark subtask test case, e.g. "entities_0"
+        score (str): The benchmark score, e.g. "1/1"
+        iterations (str): The number of iterations, e.g. "1"
+    """
+    results = pd.read_csv(file_path, header=0)
+    new_row = pd.DataFrame(
+        [[model_name, subtask, score, iterations]], columns=results.columns
+    )
+    results = pd.concat([results, new_row], ignore_index=True).sort_values(
+        by=["model_name", "subtask"]
+    )
+    results.to_csv(file_path, index=False)
+
+
+# TODO should we use SQLite? An online database (REDIS)?
+def get_result_file_path(file_name: str) -> str:
+    """Returns the path to the result file.
+
+    Args:
+        file_name (str): The name of the result file
+
+    Returns:
+        str: The path to the result file
+    """
+    return f"benchmark/results/{file_name}.csv"