Refinements to prompt injection tests and benchmark

Summary: - Refine tests to be more specific and accurate in some cases based on initial manual investigation of results. - Use LLama-70b as judge model for best results - Example results in the internal datasets folder - Some logging and QOL changes. Reviewed By: SimonWan Differential Revision: D53904223 fbshipit-source-id: 2e2efff745d1f1ffb65640e3c35348f07861b953
meta-llama · Feb 22, 2024 · e6f463e · e6f463e
1 parent e1f3762
commit e6f463e
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/CybersecurityBenchmarks/benchmark/llm.py b/CybersecurityBenchmarks/benchmark/llm.py
@@ -74,7 +74,7 @@ def _query_with_retries(
                 last_exception = exception
                 sleep_time = backoff_factor * (2**retry)
                 time.sleep(sleep_time)
-                LOG.warning(
+                LOG.debug(
                     f"LLM Query failed with error: {exception}. Sleeping for {sleep_time} seconds..."
                 )
         raise RuntimeError(