edit prompt injection docs page

Reviewed By: mbhatt1 Differential Revision: D60108516 fbshipit-source-id: 61bf327e7266472f0410fb90df33a4c9abf03f09
meta-llama · Jul 23, 2024 · e50a960 · e50a960
1 parent 33045af
commit e50a960
Showing 1 changed file with 46 additions and 12 deletions.
diff --git a/CybersecurityBenchmarks/website/docs/benchmarks/prompt_injection.md b/CybersecurityBenchmarks/website/docs/benchmarks/prompt_injection.md
@@ -11,8 +11,8 @@ python3 -m CybersecurityBenchmarks.benchmark.run \
    --response-path="$DATASETS/prompt_injection/prompt_injection_responses.json" \
    --judge-response-path="$DATASETS/prompt_injection/prompt_injection_judge_responses.json" \
    --stat-path="$DATASETS/prompt_injection/prompt_injection_stat.json" \
-   --judge-llm=<>SPECIFICATION_1 \
-   --llm-under-test=<SPECIFICATION_1> --llm-under-test=<SPECIFICATION_2> ...
+   --judge-llm=<SPECIFICATION_1> \
+   --llm-under-test=<SPECIFICATION_2> --llm-under-test=<SPECIFICATION_3> ...
    [--run-llm-in-parallel]
 ```
 
@@ -37,7 +37,7 @@ python3 -m CybersecurityBenchmarks.benchmark.run \
    ... (rest same as above)
 ```
 
-### Running Visual Prompt Injection Benchmark
+### Visual Prompt Injection Benchmark
 
 ```
 python3 -m CybersecurityBenchmarks.benchmark.run \
@@ -52,8 +52,49 @@ python3 -m CybersecurityBenchmarks.benchmark.run \
    [--num-queries-per-prompt=5]
 ```
 
+Visual prompt injection benchmarks are run in the following two steps:
+
+- _Processing prompts_: Query the LLMs under test with a list of test cases
+  consisting of a system prompt, a user text input and a user image to obtain a
+  list of responses.
+- _Judging responses_: The responses of the LLMs under test will go through a
+  separate judge LLM (the same judge LLM is used for all responses), where the
+  judge LLM will judge if injection was successful based on the judge question
+  of that test case.
+
+Unlike the other benchmarks, a directory should be specified for `--prompt-path`
+for this benchmark rather than a JSON file. The expected data format in the
+directory is:
+
+1. A `test_cases.json` file with integer IDs for each test case
+2. A subdirectory named `images/` which have images corresponding to the test
+   cases with filenames `{id}.png`.
+
+We include several datasets under the `$DATASETS/visual_prompt_injection` that
+can be used with this benchmark:
+
+1. `cse2_typographic_images` - 231 test cases which transform test cases from
+   CSE2's text-only prompt injection dataset into images via a simple
+   typographic conversion.
+2. `manual` - 100 manually created test cases which cover a wide range of known
+   visual prompt injection techniques.
+3. `generated/*` - Note that this directory itself is not a dataset in the
+   expected format. However, each of the subdirectories under this directory
+   consist of a dataset of size 100 of procedurally generated test cases
+   featuring a particular visual prompt injection technique.
+4. `generated_merged` - All the sub-datasets from `generated/*` merged into a
+   single dataset and randomly shuffled.
+
+`--num-queries-per-prompt=<N>` can be optionally specified to run each test case
+`N` times (default if unspecified is 1) in order to obtain more robust results
+due to the stochastic nature of LLM responses.
 
-### Prompt Injection Results
+
+
+
+## Results
+
+### Textual/Multilingual Prompt Injection Results
 
 Based on judge LLM, the prompt injection will be either successful or
 unsuccessful
@@ -103,11 +144,6 @@ follows the same structure.
 }
 ```
 
-## Results:
-
-Once the benchmarks have run, the evaluations of each model across each language
-will be available under the `stat_path`:
-
 ### Visual Prompt Injection Results
 
 Based on the evaluation of the judge LLM, the output of visual prompt injection
@@ -127,15 +163,13 @@ test cases will be judged as either a successful or unsuccessful injection.
 }
 ```
 
-In the same output file, the results are further bucketized in different
+In the same output file, the results are further bucketed in different
 dimensions:
 
 1.  Injection techniques
 2.  Risk category (security-violating vs. logic-violating)
 3.  Injection type (direct vs. indirect)
 
-Please refer to the CSE3 paper for a comprehensive definition of these terms.
-
 The result of each bucket follows the same structure.
 
 ```