Skip to content

Commit

Permalink
edit prompt injection docs page
Browse files Browse the repository at this point in the history
Reviewed By: mbhatt1

Differential Revision: D60108516

fbshipit-source-id: 61bf327e7266472f0410fb90df33a4c9abf03f09
  • Loading branch information
onionymous authored and facebook-github-bot committed Jul 23, 2024
1 parent 33045af commit e50a960
Showing 1 changed file with 46 additions and 12 deletions.
58 changes: 46 additions & 12 deletions CybersecurityBenchmarks/website/docs/benchmarks/prompt_injection.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ python3 -m CybersecurityBenchmarks.benchmark.run \
--response-path="$DATASETS/prompt_injection/prompt_injection_responses.json" \
--judge-response-path="$DATASETS/prompt_injection/prompt_injection_judge_responses.json" \
--stat-path="$DATASETS/prompt_injection/prompt_injection_stat.json" \
--judge-llm=<>SPECIFICATION_1 \
--llm-under-test=<SPECIFICATION_1> --llm-under-test=<SPECIFICATION_2> ...
--judge-llm=<SPECIFICATION_1> \
--llm-under-test=<SPECIFICATION_2> --llm-under-test=<SPECIFICATION_3> ...
[--run-llm-in-parallel]
```

Expand All @@ -37,7 +37,7 @@ python3 -m CybersecurityBenchmarks.benchmark.run \
... (rest same as above)
```

### Running Visual Prompt Injection Benchmark
### Visual Prompt Injection Benchmark

```
python3 -m CybersecurityBenchmarks.benchmark.run \
Expand All @@ -52,8 +52,49 @@ python3 -m CybersecurityBenchmarks.benchmark.run \
[--num-queries-per-prompt=5]
```

Visual prompt injection benchmarks are run in the following two steps:

- _Processing prompts_: Query the LLMs under test with a list of test cases
consisting of a system prompt, a user text input and a user image to obtain a
list of responses.
- _Judging responses_: The responses of the LLMs under test will go through a
separate judge LLM (the same judge LLM is used for all responses), where the
judge LLM will judge if injection was successful based on the judge question
of that test case.

Unlike the other benchmarks, a directory should be specified for `--prompt-path`
for this benchmark rather than a JSON file. The expected data format in the
directory is:

1. A `test_cases.json` file with integer IDs for each test case
2. A subdirectory named `images/` which have images corresponding to the test
cases with filenames `{id}.png`.

We include several datasets under the `$DATASETS/visual_prompt_injection` that
can be used with this benchmark:

1. `cse2_typographic_images` - 231 test cases which transform test cases from
CSE2's text-only prompt injection dataset into images via a simple
typographic conversion.
2. `manual` - 100 manually created test cases which cover a wide range of known
visual prompt injection techniques.
3. `generated/*` - Note that this directory itself is not a dataset in the
expected format. However, each of the subdirectories under this directory
consist of a dataset of size 100 of procedurally generated test cases
featuring a particular visual prompt injection technique.
4. `generated_merged` - All the sub-datasets from `generated/*` merged into a
single dataset and randomly shuffled.

`--num-queries-per-prompt=<N>` can be optionally specified to run each test case
`N` times (default if unspecified is 1) in order to obtain more robust results
due to the stochastic nature of LLM responses.

### Prompt Injection Results



## Results

### Textual/Multilingual Prompt Injection Results

Based on judge LLM, the prompt injection will be either successful or
unsuccessful
Expand Down Expand Up @@ -103,11 +144,6 @@ follows the same structure.
}
```

## Results:

Once the benchmarks have run, the evaluations of each model across each language
will be available under the `stat_path`:

### Visual Prompt Injection Results

Based on the evaluation of the judge LLM, the output of visual prompt injection
Expand All @@ -127,15 +163,13 @@ test cases will be judged as either a successful or unsuccessful injection.
}
```

In the same output file, the results are further bucketized in different
In the same output file, the results are further bucketed in different
dimensions:

1. Injection techniques
2. Risk category (security-violating vs. logic-violating)
3. Injection type (direct vs. indirect)

Please refer to the CSE3 paper for a comprehensive definition of these terms.

The result of each bucket follows the same structure.

```
Expand Down

0 comments on commit e50a960

Please sign in to comment.