test details

biocypher · Feb 8, 2024 · 488688b · 488688b
1 parent f1bb3c4
commit 488688b
Showing 1 changed file with 3 additions and 2 deletions.
diff --git a/content/20.results.md b/content/20.results.md
@@ -77,7 +77,7 @@ For models that offer quantisation options, 4- and 5-bit models perform best, wh
 
 To evaluate the benefit of BioChatter functionality, we compare the performance of models with and without the use of BioChatter's prompt engine for KG querying.
 The models without prompt engine still have access to the BioCypher schema definition, which details the KG structure, but it does not use the multi-step procedure available through BioChatter.
-Consequently, the models without prompt engine show a lower performance in creating correct queries than the models with prompt engine (0.459±0.13 vs 0.813±0.15, p = 1.3e-20, Figure @fig:benchmark B).
+Consequently, the models without prompt engine show a lower performance in creating correct queries than the same models with prompt engine (0.459±0.13 vs 0.813±0.15, unpaired t-test p = 1.3e-20, Figure @fig:benchmark B).
 
 <!-- Figure 3 -->
 ![
@@ -87,7 +87,8 @@ While the closed-source models from OpenAI show consistently highest performance
 However, the measured performance does not correlate intuitively with size (indicated by point size) and quantisation (bit-precision) of the models.
 Some smaller models perform better than larger ones, even within the same model family; while very low bit-precision (2-bit) expectedly yields worse performance, the same is true for the high end (8-bit).
 *: Of note, many characteristics of OpenAI models are not public, and thus their bit-precision (as well as the exact size of GPT4) is subject to speculation.
-B) Comparison of the two benchmark tasks for KG querying show the superior performance of BioChatter's prompt engine (0.813±0.15 vs 0.459±0.13, p = 1.3e-20).
+B) Comparison of the two benchmark tasks for KG querying show the superior performance of BioChatter's prompt engine (0.813±0.15 vs 0.459±0.13, unpaired t-test p = 1.3e-20).
+The test includes all models, sizes, and quantisation levels, and the performance is measured as the average of the two tasks.
 The BioChatter variant involves a multi-step procedure of constructing the query, while the "naive" version only receives the complete schema definition of the BioCypher KG (which BioChatter also uses as a basis for the prompt engine).
 The general instructions for both variants are the same, otherwise.
 ](images/biochatter_benchmark.png "Benchmark results"){#fig:benchmark}