diff --git a/docs/approval.qmd b/docs/approval.qmd index 73af3f6d7..ad9af4909 100644 --- a/docs/approval.qmd +++ b/docs/approval.qmd @@ -33,13 +33,14 @@ You can chain to together the `human` and `auto` approvers in an *approval polic ``` yaml approvers: - name: human - tools: ["web_browser_click", "web_browser_type*"] + tools: ["web_browser_click", "web_browser_type"] - name: auto tools: "*" ``` -Navigational web browser tool calls (e.g. `web_browser_go`) are approved automatically via the catch-all `auto` approver at the end of the chain. Note that when listing an approver in a policy you indicate which tools it should handle using a glob or list of globs. + +Navigational web browser tool calls (e.g. `web_browser_go`) are approved automatically via the catch-all `auto` approver at the end of the chain. Note that when listing an approver in a policy you indicate which tools it should handle using a glob or list of globs. These globs are prefix matched so the `web_browser_type` glob matches both `web_browser_type` and `web_browser_type_submit`. To use this policy, pass the path to the policy YAML file as the approver. For example: @@ -47,6 +48,24 @@ To use this policy, pass the path to the policy YAML file as the approver. For e inspect eval browser.py --approval approval.yaml ``` +You can also match on tool arguments (for tools that dispatch many action types). For example, here is an approval policy for the [Computer Tool](tools.qmd#sec-computer) which allows typing and mouse movement but requires approval for key combos (e.g. Enter or a shortcut) and typing: + + +```{.yaml filename="approval.yaml"} +approvers: + - name: human + tools: + - computer(action='key' + - computer(action='left_click' + - computer(action='middle_click' + - computer(action='double_click' + + - name: auto + tools: "*" +``` + +Note that since this is a prefix match and there could be other arguments, we don't end the tool match pattern with a parentheses. + ## Approvers in Code We've demonstrated configuring approvers via a YAML approval policy file—you can also provide a policy directly in code (useful if it needs to be more dynamic). Here's a pure Python version of the example from the previous section: @@ -152,7 +171,7 @@ Assuming we have properly [registered our approver](extensions.qmd#sec-extension ``` yaml approvers: - name: evaltools/bash_allowlist - tools: "*bash*" + tools: "bash" allowed_commands: ["ls", "echo", "cat"] - name: human diff --git a/docs/images/vnc-port-info.png b/docs/images/vnc-port-info.png new file mode 100644 index 000000000..0ca1b85de Binary files /dev/null and b/docs/images/vnc-port-info.png differ diff --git a/docs/images/vnc-view-only.png b/docs/images/vnc-view-only.png new file mode 100644 index 000000000..bbe1a580d Binary files /dev/null and b/docs/images/vnc-view-only.png differ diff --git a/docs/tools.qmd b/docs/tools.qmd index 1e031c6df..01065ff0c 100644 --- a/docs/tools.qmd +++ b/docs/tools.qmd @@ -6,7 +6,7 @@ title: Tools Many models now have the ability to interact with client-side Python functions in order to expand their capabilities. This enables you to equip models with your own set of custom tools so they can perform a wider variety of tasks. -Inspect natively supports registering Python functions as tools and providing these tools to models that support them (currently OpenAI, Claude 3, Google Gemini, and Mistral). Inspect also includes several built-in tools ([bash](#sec-bash-and-python), [python](#sec-bash-and-python), and [web_search](#sec-web-search)). +Inspect natively supports registering Python functions as tools and providing these tools to models that support them (currently OpenAI, Claude 3, Google Gemini, and Mistral). Inspect also includes several built-in tools ([bash](#sec-bash-and-python), [python](#sec-bash-and-python), [computer](#sec-computer), [web browser](#sec-web-browser), and [web_search](#sec-web-search)). ::: callout-note ### Tools and Agents @@ -22,6 +22,8 @@ Inspect has several built-in tools, including: - [Web Browser](#sec-web-browser), which provides the model with a headless Chromium web browser that supports navigation, history, and mouse/keyboard interactions. +- [Computer](#sec-computer), which provides the model with a desktop computer (viewed through screenshots) that supports mouse and keyboard interaction. + - [Web Search](#sec-web-search), which uses the Google Search API to execute and summarise web searches. If you are only interested in using the built-in tools, check out their respective documentation links above. To learn more about creating your own tools read on immediately below. @@ -371,16 +373,16 @@ Note that unlike some other tool functions like `bash()`, the `web_browser()` fu If you review the transcripts of a sample with access to the web browser tool, you'll notice that there are several distinct tools made available for control of the web browser. These tools include: -| Tool | Description | +| Tool | Description | |------------------------------------|------------------------------------| -| `web_browser_go(url)` | Navigate the web browser to a URL. | -| `web_browser_click(element_id)` | Click an element on the page currently displayed by the web browser. | -| `web_browser_type(element_id)` | Type text into an input on a web browser page. | +| `web_browser_go(url)` | Navigate the web browser to a URL. | +| `web_browser_click(element_id)` | Click an element on the page currently displayed by the web browser. | +| `web_browser_type(element_id)` | Type text into an input on a web browser page. | | `web_browser_type_submit(element_id, text)` | Type text into a form input on a web browser page and press ENTER to submit the form. | -| `web_browser_scroll(direction)` | Scroll the web browser up or down by one page. | -| `web_browser_forward()` | Navigate the web browser forward in the browser history. | -| `web_browser_back()` | Navigate the web browser back in the browser history. | -| `web_browser_refresh()` | Refresh the current page of the web browser. | +| `web_browser_scroll(direction)` | Scroll the web browser up or down by one page. | +| `web_browser_forward()` | Navigate the web browser forward in the browser history. | +| `web_browser_back()` | Navigate the web browser back in the browser history. | +| `web_browser_refresh()` | Refresh the current page of the web browser. | : {tbl-colwidths=\[35,65\]} @@ -420,6 +422,162 @@ CMD ["python3", "/app/web_browser/web_server.py"] Note that all of the Python files in the [\_resources](https://github.com/UKGovernmentBEIS/inspect_ai/blob/main/src/inspect_ai/tool/_tools/_web_browser/_resources/) directory alongside the `Dockerfile` need to be available for copying when building the container. +## Computer (Beta) {#sec-computer} + +::: {.callout-note appearance="simple"} +The beta version of the computer tool described below is currently available only in the development version of Inspect. To install the development version: + +``` bash +pip install git+https://github.com/UKGovernmentBEIS/inspect_ai +``` +::: + +The `computer()` tool provides models with a computer desktop environment along with the ability to view the screen and perform mouse and keyboard gestures. The computer tool is based on the Anthropic [Computer Use Beta](https://docs.anthropic.com/en/docs/build-with-claude/computer-use) reference implementation and works with any model that supports image input. + + The current release of the computer tool is a beta version (exported from the `inspect_ai.tool.beta` module). We expect to finalise the interface and move it into the main `inspect_ai.tool` module over the next several weeks. + +### Configuration + +The `computer()` tool runs within a Docker container. To use it with a task you need to reference the `inspect-computer-tool-beta` image in your Docker compose file. For example: + +``` {.yaml filename="compose.yaml"} +services: + default: + image: inspect-computer-tool-beta +``` + +You can configure the container to not have Internet access as follows: + +``` {.yaml filename="compose.yaml"} +services: + default: + image: inspect-computer-tool-beta + network_mode: none +``` + +Note that if you'd like to be able to view the model's interactions with the computer desktop in realtime, you will need to also do some port mapping to enable a VNC connection with the container. See the [VNC Client](#vnc-client) section below for details on how to do this. + +The `inspect-computer-tool-beta` image is based on the [ubuntu:22.04](https://hub.docker.com/layers/library/ubuntu/22.04/images/sha256-965fbcae990b0467ed5657caceaec165018ef44a4d2d46c7cdea80a9dff0d1ea?context=explore) image and includes the following additional applications pre-installed: + +- Firefox +- VS Code +- Xpdf +- Xpaint +- galculator + +We'll be refining this list as well as publishing more information on creating custom containers for use with the computer tool soon. + +### Task Setup + +A task configured to use the computer tool might look like this: + +``` python +from inspect_ai import Task, task +from inspect_ai.scorer import match +from inspect_ai.solver import generate, use_tools +from inspect_ai.tool.beta import computer + +@task +def computer_task(): + return Task( + dataset=read_dataset(), + solver=[ + use_tools([computer()]), + generate(), + ], + scorer=match(), + sandbox=("docker", "compose.yaml"), + ) +``` + +Two of the Inspect examples demonstrate basic computer use: + +- [computer](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/examples/computer/computer.py) — Three simple computing tasks as a minimal demonstration of computer use. + + ``` bash + inspect eval examples/computer + ``` + +- [intervention](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/examples/intervention/intervention.py) — Computer task driven interactively by a human operator. + + ``` bash + inspect eval examples/intervention -T mode=computer --display conversation + ``` + +### VNC Client {#vnc-client} + +You can use a [VNC](https://en.wikipedia.org/wiki/VNC) connection to the container to watch computer use in real-time. This requires some additional port-mapping in the Docker compose file. You can define dynamic port ranges for VNC (5900) and a browser based noVNC client (6080) with the following `ports` entries: + +``` {.yaml filename="compose.yaml"} +services: + default: + image: inspect-computer-tool-beta + ports: + - "5900" + - "6080" +``` + +To connect to the container for a given sample, locate the sample in the **Running Samples** UI and expand the sample info panel at the top: + +![](images/vnc-port-info.png){width=958 .lightbox} + +Click on the link for the noVNC browser client, or use a native VNC client to connect to the VNC port. Note that the VNC server will take a few seconds to start up so you should give it some time and attempt to reconnect as required if the first connection fails. + +The browser based client provides a view-only interface. If you use a native VNC client you should also set it to "view only" so as to not interfere with the model's use of the computer. For example, for Real VNC Viewer: + +![](images/vnc-view-only.png){width="549"} + +### Approval + +If the container you are using is connected to the Internet, you may want to configure human approval for a subset of computer tool actions. Here are the possible actions (specified using the `action` parameter to the `computer` tool): + +- `key`: Press a key or key-combination on the keyboard. +- `type`: Type a string of text on the keyboard. +- `cursor_position`: Get the current (x, y) pixel coordinate of the cursor on the screen. +- `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen. +- Example: execute(action="mouse_move", coordinate=(100, 200)) +- `left_click`: Click the left mouse button. +- `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen. +- `right_click`: Click the right mouse button. +- `middle_click`: Click the middle mouse button. +- `double_click`: Double-click the left mouse button. +- `screenshot`: Take a screenshot. + + +Here is an approval policy that requires approval for key combos (e.g. `Enter` or a shortcut) and mouse clicks: + +```{.yaml filename="approval.yaml"} +approvers: + - name: human + tools: + - computer(action='key' + - computer(action='left_click' + - computer(action='middle_click' + - computer(action='double_click' + + - name: auto + tools: "*" +``` + +Note that since this is a prefix match and there could be other arguments, we don't end the tool match pattern with a parentheses. + +You can apply this policy using the `--approval` commmand line option: + +```bash +inspect eval computer.py --approval approval.yaml +``` + +### Tool Binding + +The computer tool's schema is based on the standard Anthropoic [computer tool-type](https://docs.anthropic.com/en/docs/build-with-claude/computer-use#computer-tool). When using Claude 3.5 the coputer tool will automatically bind to the native Claude computer tool definition. This presumably provides improved performance due to fine tuning on the use of the tool but we have not verified this. + +If you want to experiement with bypassing the native Claude computer tool type and just register the computer tool as a normal function based tool then specify the `--no-internal-tools` generation option as follows: + +```bash +inspect eval computer.py --no-internal-tools +``` + + ## Web Search {#sec-web-search} The `web_search()` tool provides models the ability to enhance their context window by performing a search. By default web searches retrieve 10 results from a provider, uses a model to determine if the contents is relevant then returns the top 3 relevant search results to the main model. Here is the definition of the `web_search()` function: @@ -465,3 +623,4 @@ The `web_search()` tool uses [Google Programmable Search Engine](https://program - `GOOGLE_CSE_ID` — Google Custom Search Engine ID - `GOOGLE_CSE_API_KEY` — Google API key used to enable the Search API + diff --git a/examples/computer/compose.yaml b/examples/computer/compose.yaml new file mode 100644 index 000000000..7e2de2e8c --- /dev/null +++ b/examples/computer/compose.yaml @@ -0,0 +1,24 @@ +services: + default: + # Temporary internal image until the official one is available + image: inspect-computer-tool-beta + init: true + + # If you only launch a single container, you can vnc into that container by using + # the following port mapping + # ports: + # - "5900:5900" + # - "6080:6080" + + # If you launch multiple containers, you can vnc into each container by using the + # following port mapping which will dynamically bind to host ports. The specific + # bindings can be found by using `docker inspect `. This + # info is included in Running Samples tab. The output will look something like: + # + # service container port host port url + # VNC 5900 61029 vnc://localhost:61029 + # noVNC 6080 61030 http://localhost:61030?view_only=true&autoconnect=true + + ports: + - "5900" + - "6080" diff --git a/examples/computer/computer.py b/examples/computer/computer.py new file mode 100644 index 000000000..4abc07373 --- /dev/null +++ b/examples/computer/computer.py @@ -0,0 +1,50 @@ +from inspect_ai import Task, task +from inspect_ai.dataset import Sample +from inspect_ai.scorer import includes +from inspect_ai.solver import system_message +from inspect_ai.solver._basic_agent import basic_agent +from inspect_ai.tool.beta import computer + +SYSTEM_MESSAGE = """ +Before each step, please clearly explain your intent for performing a tool action: "I expect tool command X to ...". + +After each step, carefully evaluate the resulting screenshot to see if the command achieved the right outcome. + +Explicitly show your thinking: "I have evaluated step X..." If not correct, try again. Only when +you confirm a step was executed correctly should you move on to the next one. + +Note that launching applications from the bottom task bar requires a single left click. +""" + + +@task +def computer_task(): + return Task( + dataset=[ + Sample( + input="Use the computer tool to find the contents of the file '/tmp/flag.txt'. When you've found it, report its contents.", + target="Frunobulax", + files={"/tmp/flag.txt": "Frunobulax"}, + ), + Sample( + input="Use the computer tool and open the spreadsheet '/tmp/moonWeight.ods`. Enter your weight as 156 lbs. Report the result from cell B2. Important: If the 'Tip of the Day' dialog appears, you'll need to close it before proceeding. Important: You may need to install a spreadsheet using a command like 'sudo apt-get install -y libreoffice'.", + target="26", + files={"/tmp/moonWeight.ods": "moonWeight.ods"}, + ), + Sample( + input="Use the computer tool to launch a terminal. Type 'Trudging across the tundra. Mile after Mile.' into the terminal. Important: Make sure that the terminal window is active before typing. When you are done, please use the submit tool to record the result of hitting enter in the terminal after entering that text.", + target="bash: Trudging: command not found", + ), + Sample( + input="Use the computer tool to launch a calculator. Calculate 123 x 456. Report the result.", + target="56088", + ), + ], + solver=basic_agent( + init=system_message(SYSTEM_MESSAGE), + tools=[computer()], + max_messages=100, + ), + scorer=includes(), + sandbox="docker", + ) diff --git a/examples/computer/moonWeight.ods b/examples/computer/moonWeight.ods new file mode 100644 index 000000000..067bcace4 Binary files /dev/null and b/examples/computer/moonWeight.ods differ diff --git a/examples/intervention/README.md b/examples/intervention/README.md index f62380871..55d19be50 100644 --- a/examples/intervention/README.md +++ b/examples/intervention/README.md @@ -1,164 +1,47 @@ -# Intervention Mode Demo +# Intervention Demo ## Introduction -This is a prototype of an Inspect agent with human intervention. It utilises Inspect's [Interactivity features](https://inspect.ai-safety-institute.org.uk/interactivity.html). This is meant to serve as a starting point for evaluations which need these features, such as manual open-ended probing. +This is a prototype of an Inspect agent running in a Linux sandbox with human intervention. It utilises Inspect's [Interactivity features](https://inspect.ai-safety-institute.org.uk/interactivity.html). This is meant to serve as a starting point for evaluations which need these features, such as manual open-ended probing. -This gives an overview of this task and how to customise it. Note that this task is intended to be run within Inspect's approval mode (so you can confirm tool calls) and conversation display mode (so you can see the conversation with the model). For example: +## Usage Modes -``` bash -inspect eval exmaples/intervention.py --approval human --display conversation -``` - - -## Inspect Code - -### Task Definition - -At the top of `intervention.py`, we have our Inspect task: - -``` python -@task -def intervention(): - return Task( - solver=[ - system_prompt(), - user_prompt(), - use_tools([bash(), python()]), - agent_loop(), - ], - sandbox="docker", - ) -``` - -As you can see, this is just a regular Inspect evaluation. It should work with any model, and similar features can be integrated into any Inspect evaluation that would benefit from human intervention. - -A few things to note here: - -- To add more tools, just pass them to the `agent_loop()`. -- You may want to customise the system prompt, which can be found in the `system_prompt()` solver. -- It uses an agent loop using the [lower level `model.generate()` API.](https://inspect.ai-safety-institute.org.uk/agents-api.html) - -### User Prompt - -First, we ask the user to enter a prompt for the model: +Two modes are supported: `shell` mode equips the model with bash and python tools, and `computer` mode provides it with a full desktop computer. To run in the (default) shell mode, use this (note we also specify `--display=conversation` to print all of the user and assistant messages to the terminal): -``` python -@solver -def user_prompt() -> Solver: - async def solve(state: TaskState, generate: Generate) -> TaskState: - with input_screen("User Prompt") as console: - state.user_prompt.content = Prompt.ask( - "Please enter your initial prompt for the model:\n\n", console=console - ) - return state - - return solve -``` - -If you have a static prompt you want to use, or a dataset of these, you could put those in your `Task` `Dataset` and then skip the use of this solver. - -### Agent Loop - -The agent loop is [just a plain Inspect solver.](https://inspect.ai-safety-institute.org.uk/solvers.html). It executes `generate()` which handles tool calls and updating the `TaskState` with new messages. Since we are running in trace mode all messages exchanged with the model will be printed. As a result of running with the human approver the user will be prompted to approve all tools calls. - -``` python -@solver -def agent_loop() -> Solver: - async def solve(state: TaskState, generate: Generate) -> TaskState: - while not state.completed: - # generate w/ tool calls, approvals, etc. - state = await generate(state) - - # prompt for next action - next_action = ask_for_next_action() - with input_screen(): - match next_action.strip().lower(): - case "exit": - break - case "": - state.messages.append( - ChatMessageUser( - content="Please continue working on this task." - ) - ) - continue - case _: - state.messages.append(ChatMessageUser(content=next_action)) - - return state - - return solve +``` bash +inspect eval examples/intervention.py --display conversation ``` -Once the model stops calling tools, the user will get a choice to terminate the conversation, force another generation, or send a new user message to the model. - -## Sandboxing - -This evaluation is sandboxed in a Docker container. We can customize the Docker container for many different types of evaluations as is discussed in the [Inspect Sandboxing documentation](https://inspect.ai-safety-institute.org.uk/sandboxing.html). - -### The Dockerfile +To run in computer mode, use the `mode` task parameter: -The Dockerfile is pretty simple for this demo. We use a Ubuntu base image with some common Python packages installed as well as cURL. Feel free to install other packages you want. - -``` dockerfile -FROM ubuntu:24.04 - -# Update the package lists -RUN apt-get update -y && apt-get upgrade -y - -# Install any necessary packages -RUN apt-get install -y curl python3 python3-pip python3-dev python3-venv +``` bash +inspect eval examples/intervention.py -T mode=computer --display conversation ``` -We then set up a virtual environment and add it to the `PATH` so that the agent doesn't need to worry about that. +See the documentation on the [Computer Tool](https://inspect.ai-safety-institute.org.uk/tools.html#sec-computer) for additional details on Inspect comptuer use. -``` dockerfile -# Virtual environment setup -ENV VIRTUAL_ENV=/opt/venv -RUN python3 -m venv $VIRTUAL_ENV -ENV PATH="$VIRTUAL_ENV/bin:$PATH" -``` +## Approval -We also mock `sudo` as it's unnecessary in Docker, and models tend to get confused by the error messages resulting from `sudo` not being installed. +You can add human approval to either mode, by specifying the `approval` task parameter. For example: -``` dockerfile -# Mock sudo to avoid permission issues -RUN echo -e '#!/bin/sh\nexec "$@"' > /usr/bin/sudo && chmod +x /usr/bin/sudo +``` bash +inspect eval examples/intervention.py -T mode=shell -T approval=true --display conversation ``` -### Docker Compose +For `shell` mode, this will result in each and every bash or python call requiring approval. For `computer` mode, this will result in only some actions requiring approval (e.g. clicks require approval, but mouse moves do not). Here is the approval.yaml file used for computer mode: -We have a simple Docker Compose file which starts up a container specified in the `Dockerfile`. +```{.yaml filename="approval.yaml"} +approvers: + - name: human + tools: + - computer(action='key' + - computer(action='left_click' + - computer(action='middle_click' + - computer(action='double_click' -``` docker -version: '3' -services: - default: - build: . - command: tail -f /dev/null - cpus: 1.0 - mem_limit: 0.5gb - network_mode: bridge + - name: auto + tools: "*" ``` -You can customise this for advanced behaviours, such as GPU passthrough or multiple containers. Here is a `Dockerfile` used for GPU passthrough (note that this needs a lot more setup, but I've provided this as an example.) - -``` docker -version: '3' -services: - default: - build: . - command: tail -f /dev/null - cpus: 7.0 - mem_limit: 28.0gb - network_mode: bridge - deploy: - resources: - reservations: - devices: - - driver: nvidia - count: all - capabilities: [gpu] -``` \ No newline at end of file +See the [Approval](https://inspect.ai-safety-institute.org.uk/approval.html) documentation for additional details on creating approval policies. diff --git a/examples/intervention/computer/approval.yaml b/examples/intervention/computer/approval.yaml new file mode 100644 index 000000000..f49f96cbd --- /dev/null +++ b/examples/intervention/computer/approval.yaml @@ -0,0 +1,10 @@ +approvers: + - name: human + tools: + - computer(action='key' + - computer(action='left_click' + - computer(action='middle_click' + - computer(action='double_click' + + - name: auto + tools: "*" diff --git a/examples/intervention/computer/compose.yaml b/examples/intervention/computer/compose.yaml new file mode 100644 index 000000000..2b304df89 --- /dev/null +++ b/examples/intervention/computer/compose.yaml @@ -0,0 +1,9 @@ +services: + default: + # Temporary internal image until the official one is available + image: inspect-computer-tool-beta + init: true + # Dynamically assign ports for vnc and novnc + ports: + - "5900" + - "6080" diff --git a/examples/intervention/intervention.py b/examples/intervention/intervention.py index ec896d72b..5749ed98b 100644 --- a/examples/intervention/intervention.py +++ b/examples/intervention/intervention.py @@ -1,4 +1,5 @@ from textwrap import dedent +from typing import Literal from rich.prompt import Prompt @@ -12,41 +13,65 @@ system_message, use_tools, ) +from inspect_ai.solver._chain import chain from inspect_ai.tool import bash, python +from inspect_ai.tool.beta import computer from inspect_ai.util import input_screen +Mode = Literal["shell", "computer"] + @task -def intervention(): - return Task( - solver=[ - system_prompt(), - user_prompt(), - use_tools([bash(), python()]), - agent_loop(), - ], - sandbox="docker", - ) +def intervention(mode: Mode = "shell", approval: bool = False) -> Task: + if mode == "shell": + return Task( + solver=intervention_agent(mode), + sandbox=("docker", "shell/compose.yaml"), + approval="human" if approval else None, + ) + else: + return Task( + solver=intervention_agent(mode), + sandbox=("docker", "computer/compose.yaml"), + approval="computer/approval.yaml" if approval else None, + ) @solver -def system_prompt(): - SYSTEM_PROMPT = dedent(""" - You are an extremely capable general purpose agent. Your task is to take instructions from the user and accomplish tasks without human input. +def intervention_agent(mode: Mode) -> Solver: + return chain( + system_message(SHELL_PROMPT if mode == "shell" else COMPUTER_PROMPT), + user_prompt(), + use_tools([bash(), python()] if mode == "shell" else [computer()]), + agent_loop(), + ) + - You will first take instructions from the user, make a plan, and then accomplish the plan autonomously. +SHELL_PROMPT = dedent(""" + You are an extremely capable general purpose agent. Your task is to take instructions from the user and accomplish tasks without human input. - You are in a full Ubuntu environment connected to the internet. You have access to a wide range of tools and libraries and can install more if necessary. You do not need access to sudo to run apt-get or anything else. + You will first take instructions from the user, make a plan, and then accomplish the plan autonomously. + + You are in a full Ubuntu environment connected to the internet. You have access to a wide range of tools and libraries and can install more if necessary. You do not need access to sudo to run apt-get or anything else. + + Your "python" tool will execute a given script. It isn't saved and you aren't able to interact with the script to provide input or anything like that. Note that the Python tool isn't like a notebook - you need to print out the values and cannot just return them to be printed. If you do this, nothing will be returned, so make sure you print whatever variables you want returned in the output. + + If you do want a Python script to be persistent, you'll need to save it to a file using the bash tool. The bash tool also is not interactive, so you will need to interact with it in successive turns. + + You won't have many common tools installed by default, so you may need to install them yourself. + + Please reflect on what actions to take before you take them. + """) - Your "python" tool will execute a given script. It isn't saved and you aren't able to interact with the script to provide input or anything like that. Note that the Python tool isn't like a notebook - you need to print out the values and cannot just return them to be printed. If you do this, nothing will be returned, so make sure you print whatever variables you want returned in the output. +COMPUTER_PROMPT = dedent(""" + You are an extremely capable general purpose agent. Your task is to take instructions from the user and accomplish tasks without human input. - If you do want a Python script to be persistent, you'll need to save it to a file using the bash tool. The bash tool also is not interactive, so you will need to interact with it in successive turns. + You are in a full Ubuntu environment connected to the internet. - You won't have many common tools installed by default, so you may need to install them yourself. + Please reflect on what actions to take before you take them. - Please reflect on what actions to take before you take them. + After each step carefully evaluate if you have achieved the right outcome. Explicitly show your thinking: "I have evaluated step X..." If not correct, try again. Only when you confirm a step was executed correctly should you move on to the next one. """) - return system_message(SYSTEM_PROMPT) @solver diff --git a/examples/intervention/Dockerfile b/examples/intervention/shell/Dockerfile similarity index 100% rename from examples/intervention/Dockerfile rename to examples/intervention/shell/Dockerfile diff --git a/examples/intervention/compose.yaml b/examples/intervention/shell/compose.yaml similarity index 100% rename from examples/intervention/compose.yaml rename to examples/intervention/shell/compose.yaml diff --git a/src/inspect_ai/_cli/eval.py b/src/inspect_ai/_cli/eval.py index 3500298c4..fbadb78e9 100644 --- a/src/inspect_ai/_cli/eval.py +++ b/src/inspect_ai/_cli/eval.py @@ -365,6 +365,14 @@ def eval_options(func: Callable[..., Any]) -> Callable[..., click.Context]: help="Whether to enable parallel function calling during tool use (defaults to True) OpenAI and Groq only.", envvar="INSPECT_EVAL_PARALLEL_TOOL_CALLS", ) + @click.option( + "--internal-tools/--no-internal-tools", + type=bool, + is_flag=True, + default=True, + help="Whether to automatically map tools to model internal implementations (e.g. 'computer' for anthropic).", + envvar="INSPECT_EVAL_INTERNAL_TOOLS", + ) @click.option( "--max-tool-output", type=int, @@ -439,6 +447,7 @@ def eval_command( logprobs: bool | None, top_logprobs: int | None, parallel_tool_calls: bool | None, + internal_tools: bool | None, max_tool_output: int | None, cache_prompt: str | None, reasoning_effort: str | None, @@ -598,6 +607,7 @@ def eval_set_command( logprobs: bool | None, top_logprobs: int | None, parallel_tool_calls: bool | None, + internal_tools: bool | None, max_tool_output: int | None, cache_prompt: str | None, reasoning_effort: str | None, @@ -836,6 +846,9 @@ def config_from_locals(locals: dict[str, Any]) -> GenerateConfigArgs: if key == "parallel_tool_calls": if value is not False: value = None + if key == "internal_tools": + if value is not False: + value = None config[key] = value # type: ignore return config diff --git a/src/inspect_ai/_display/core/config.py b/src/inspect_ai/_display/core/config.py index 990154be6..2796753bf 100644 --- a/src/inspect_ai/_display/core/config.py +++ b/src/inspect_ai/_display/core/config.py @@ -13,14 +13,14 @@ def task_config( value = task_args[key] if is_registry_dict(value): task_args[key] = value["name"] - config = task_args | dict(profile.eval_config.model_dump(exclude_none=True)) + config = dict(profile.eval_config.model_dump(exclude_none=True)) | task_args if generate_config: - config = config | dict(profile.generate_config.model_dump(exclude_none=True)) + config = dict(profile.generate_config.model_dump(exclude_none=True)) | config if profile.tags: config["tags"] = ",".join(profile.tags) config_print: list[str] = [] for name, value in config.items(): - if name == "approval": + if name == "approval" and isinstance(value, dict): config_print.append( f"{name}: {','.join([approver['name'] for approver in value['approvers']])}" ) diff --git a/src/inspect_ai/_display/textual/widgets/port_mappings.py b/src/inspect_ai/_display/textual/widgets/port_mappings.py new file mode 100644 index 000000000..10f850e02 --- /dev/null +++ b/src/inspect_ai/_display/textual/widgets/port_mappings.py @@ -0,0 +1,110 @@ +from typing import Literal + +from textual.app import ComposeResult +from textual.containers import HorizontalScroll +from textual.widget import Widget +from textual.widgets import Link, Static + +from inspect_ai._util.port_names import get_service_by_port +from inspect_ai.util._sandbox.environment import PortMapping + + +class PortMappingsView(HorizontalScroll): + DEFAULT_CSS = """ + PortMappingsView { + layout: grid; + height: auto; + grid-size: 4 3; + grid-columns: auto auto auto auto; + grid-gutter: 0 1; + } + """ + + def __init__(self, ports: list[PortMapping] | None) -> None: + super().__init__() + self.ports = ports + + def compose(self) -> ComposeResult: + if not self.ports: + return + yield Static("service") + yield Static("sandbox") + yield Static("client") + yield Static("endpoint") + mappings_and_services = [ + (mapping, get_service_by_port(mapping.container_port, mapping.protocol)) + for mapping in self.ports + ] + remaining_widgets = [ + widget + for mapping_and_service in mappings_and_services + for widget in widgets_from_port_mapping(mapping_and_service) + ] + for widget in remaining_widgets: + yield widget + + +def widgets_for_port_mappings( + port_mappings: list[PortMapping] | None, +) -> list[Widget]: + if port_mappings is None: + return [] + return [ + static + for mapping in [ + (mapping, get_service_by_port(mapping.container_port, mapping.protocol)) + for mapping in port_mappings + ] + for static in widgets_from_port_mapping(mapping) + ] + + +def widgets_from_port_mapping( + mapping_service_tuple: tuple[PortMapping, str | None], +) -> list[Widget]: + port_mapping, service = mapping_service_tuple + return [ + widget + for host_mapping in port_mapping.mappings + for widget in get_row_widgets( + port_mapping.protocol, + host_mapping.host_port, + port_mapping.container_port, + service, + ) + ] + + +def get_row_widgets( + protocol: Literal["tcp", "udp"], + host_port: int, + container_port: int, + service: str | None, +) -> list[Widget]: + url = get_url( + host_port, + service, + ) + return [ + Static(service if service is not None else protocol), + Static(str(container_port)), + Static(str(host_port)), + Link(url) if url is not None else Static("asdf"), + ] + + +def get_url( + host_port: int, + service: str | None, +) -> str | None: + if service is not None: + if service == "noVNC": + return f"http://localhost:{host_port}?view_only=true&autoconnect=true&resize=scale" + + if service.startswith("HTTP"): + return f"https://localhost:{host_port}" + + if service.startswith("VNC"): + return f"vnc://localhost:{host_port}" + + return None diff --git a/src/inspect_ai/_display/textual/widgets/samples.py b/src/inspect_ai/_display/textual/widgets/samples.py index 40b467e2d..943b00cfb 100644 --- a/src/inspect_ai/_display/textual/widgets/samples.py +++ b/src/inspect_ai/_display/textual/widgets/samples.py @@ -5,21 +5,10 @@ from rich.table import Table from rich.text import Text from textual.app import ComposeResult -from textual.containers import ( - Horizontal, - HorizontalGroup, - Vertical, - VerticalGroup, -) +from textual.containers import Horizontal, HorizontalGroup, Vertical, VerticalGroup from textual.reactive import reactive from textual.widget import Widget -from textual.widgets import ( - Button, - Collapsible, - LoadingIndicator, - OptionList, - Static, -) +from textual.widgets import Button, Collapsible, LoadingIndicator, OptionList, Static from textual.widgets.option_list import Option, Separator from inspect_ai._util.format import format_progress_time @@ -28,6 +17,7 @@ from inspect_ai.log._transcript import ToolEvent from .clock import Clock +from .sandbox import SandboxView from .transcript import TranscriptView @@ -218,6 +208,7 @@ class SampleInfo(Horizontal): def __init__(self) -> None: super().__init__() self._sample: ActiveSample | None = None + self._sandbox_count: int | None = None def compose(self) -> ComposeResult: with Collapsible(title=""): @@ -233,12 +224,14 @@ async def sync_sample(self, sample: ActiveSample | None) -> None: limits = self.query_one(SampleLimits) await limits.sync_sample(sample) + new_sandbox_count = len(sample.sandboxes) # bail if we've already processed this sample - if self._sample == sample: + if self._sample == sample and self._sandbox_count == new_sandbox_count: return # set sample self._sample = sample + self._sandbox_count = new_sandbox_count # update UI self.display = True @@ -295,6 +288,9 @@ class SandboxesView(Vertical): background: transparent; height: auto; } + #sandboxes-list { + height: auto; + } SandboxesView Static { background: transparent; } @@ -312,16 +308,22 @@ def compose(self) -> ComposeResult: async def sync_sample(self, sample: ActiveSample) -> None: if len(sample.sandboxes) > 0: + multiple_sandboxes = len(sample.sandboxes) > 1 self.display = True sandboxes_caption = cast(Static, self.query_one("#sandboxes-caption")) - sandboxes_caption.update("[bold]sandbox containers:[/bold]") + sandboxes_caption.update( + f"[bold]sandbox container{'s' if multiple_sandboxes else ''}:[/bold]" + ) sandboxes_list = self.query_one("#sandboxes-list") await sandboxes_list.remove_children() + await sandboxes_list.mount_all( - [Static(sandbox.command) for sandbox in sample.sandboxes.values()] + SandboxView(connection, name if multiple_sandboxes else None) + for name, connection in sample.sandboxes.items() ) - sandboxes_list.mount( + + await sandboxes_list.mount( Static( "[italic]Hold down Alt (or Option) to select text for copying[/italic]", classes="clipboard-message", diff --git a/src/inspect_ai/_display/textual/widgets/sandbox.py b/src/inspect_ai/_display/textual/widgets/sandbox.py new file mode 100644 index 000000000..b7cd49dad --- /dev/null +++ b/src/inspect_ai/_display/textual/widgets/sandbox.py @@ -0,0 +1,37 @@ +from textual.app import ComposeResult +from textual.containers import Horizontal, Vertical +from textual.widgets import Static + +from inspect_ai.util._sandbox.environment import SandboxConnection + +from .port_mappings import PortMappingsView + + +class SandboxView(Vertical): + DEFAULT_CSS = """ + .indent { + width: 2; + } + .no_indent { + width: 0; + } + """ + + def __init__( + self, + connection: SandboxConnection, + name: str | None, # if None, no header or indent + ) -> None: + super().__init__() + self.sandbox_name = name + self.connection = connection + + def compose(self) -> ComposeResult: + if self.sandbox_name: + yield Static(self.sandbox_name) + with Horizontal(): + yield Static("", classes="indent" if self.sandbox_name else "no_indent") + with Vertical(): + yield Static(self.connection.command) + if self.connection.ports: + yield PortMappingsView(self.connection.ports) diff --git a/src/inspect_ai/_eval/task/run.py b/src/inspect_ai/_eval/task/run.py index 71fa8804b..d91f22e94 100644 --- a/src/inspect_ai/_eval/task/run.py +++ b/src/inspect_ai/_eval/task/run.py @@ -27,10 +27,7 @@ from inspect_ai._util.datetime import iso_now from inspect_ai._util.error import exception_message from inspect_ai._util.hooks import send_telemetry -from inspect_ai._util.registry import ( - is_registry_object, - registry_log_name, -) +from inspect_ai._util.registry import is_registry_object, registry_log_name from inspect_ai._util.timeouts import Timeout, timeout, timeout_at from inspect_ai._view.notify import view_notify_eval from inspect_ai.dataset import Dataset, Sample diff --git a/src/inspect_ai/_util/constants.py b/src/inspect_ai/_util/constants.py index 0d90cf12e..55fe40ad6 100644 --- a/src/inspect_ai/_util/constants.py +++ b/src/inspect_ai/_util/constants.py @@ -37,3 +37,4 @@ CONSOLE_DISPLAY_WIDTH = 120 BASE_64_DATA_REMOVED = "" SANDBOX_SETUP_TIMEOUT = 300 +NO_CONTENT = "(no content)" diff --git a/src/inspect_ai/_util/port_names.py b/src/inspect_ai/_util/port_names.py new file mode 100644 index 000000000..20cbb4a53 --- /dev/null +++ b/src/inspect_ai/_util/port_names.py @@ -0,0 +1,61 @@ +from typing import Literal + + +def get_service_by_port(port: int, protocol: Literal["tcp", "udp"]) -> str | None: + """ + Returns the likely service running on a given port number. + + Args: + port (int): The port number to look up + protocol (str): Either 'tcp' or 'udp' + + Returns: + str: Description of the likely service, or None if not found + """ + # Common port mappings based on IANA assignments and common usage + port_mappings = { + "tcp": { + 20: "FTP (Data)", + 21: "FTP (Control)", + 22: "SSH", + 23: "Telnet", + 25: "SMTP", + 53: "DNS", + 80: "HTTP", + 110: "POP3", + 143: "IMAP", + 443: "HTTPS", + 445: "Microsoft-DS (SMB)", + 587: "SMTP (Submission)", + 993: "IMAPS", + 995: "POP3S", + 1433: "Microsoft SQL Server", + 1521: "Oracle Database", + 3306: "MySQL", + 3389: "RDP (Remote Desktop)", + 5432: "PostgreSQL", + 5900: "VNC", + 5901: "VNC Display :1", + 5902: "VNC Display :2", + 6080: "noVNC", + 8080: "HTTP Alternate", + 8443: "HTTPS Alternate", + 27017: "MongoDB", + 27018: "MongoDB Shard", + 27019: "MongoDB Config Server", + }, + "udp": { + 53: "DNS", + 67: "DHCP Server", + 68: "DHCP Client", + 69: "TFTP", + 123: "NTP", + 161: "SNMP", + 162: "SNMP Trap", + 514: "Syslog", + 1194: "OpenVPN", + 5353: "mDNS", + }, + } + + return port_mappings.get(protocol, {}).get(port, None) diff --git a/src/inspect_ai/_view/www/log-schema.json b/src/inspect_ai/_view/www/log-schema.json index e3b7a340c..f5b78e75b 100644 --- a/src/inspect_ai/_view/www/log-schema.json +++ b/src/inspect_ai/_view/www/log-schema.json @@ -1137,6 +1137,7 @@ "logprobs": null, "top_logprobs": null, "parallel_tool_calls": null, + "internal_tools": null, "max_tool_output": null, "cache_prompt": null, "reasoning_effort": null @@ -2190,6 +2191,18 @@ "default": null, "title": "Parallel Tool Calls" }, + "internal_tools": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "default": null, + "title": "Internal Tools" + }, "max_tool_output": { "anyOf": [ { @@ -2258,6 +2271,7 @@ "logprobs", "top_logprobs", "parallel_tool_calls", + "internal_tools", "max_tool_output", "cache_prompt", "reasoning_effort" @@ -4207,6 +4221,7 @@ "best_of": null, "cache_prompt": null, "frequency_penalty": null, + "internal_tools": null, "logit_bias": null, "logprobs": null, "max_connections": null, diff --git a/src/inspect_ai/_view/www/src/types/log.d.ts b/src/inspect_ai/_view/www/src/types/log.d.ts index 551055c9c..8c461ca6b 100644 --- a/src/inspect_ai/_view/www/src/types/log.d.ts +++ b/src/inspect_ai/_view/www/src/types/log.d.ts @@ -76,6 +76,7 @@ export type NumChoices = number | null; export type Logprobs = boolean | null; export type TopLogprobs = number | null; export type ParallelToolCalls = boolean | null; +export type InternalTools = boolean | null; export type MaxToolOutput = number | null; export type CachePrompt = "auto" | boolean | null; export type ReasoningEffort = ("low" | "medium" | "high") | null; @@ -545,6 +546,7 @@ export interface GenerateConfig { logprobs: Logprobs; top_logprobs: TopLogprobs; parallel_tool_calls: ParallelToolCalls; + internal_tools: InternalTools; max_tool_output: MaxToolOutput; cache_prompt: CachePrompt; reasoning_effort: ReasoningEffort; @@ -897,6 +899,7 @@ export interface GenerateConfig1 { logprobs: Logprobs; top_logprobs: TopLogprobs; parallel_tool_calls: ParallelToolCalls; + internal_tools: InternalTools; max_tool_output: MaxToolOutput; cache_prompt: CachePrompt; reasoning_effort: ReasoningEffort; diff --git a/src/inspect_ai/approval/_policy.py b/src/inspect_ai/approval/_policy.py index 8314934ba..b4625a352 100644 --- a/src/inspect_ai/approval/_policy.py +++ b/src/inspect_ai/approval/_policy.py @@ -1,13 +1,13 @@ import fnmatch -import re +import sys from dataclasses import dataclass from pathlib import Path -from re import Pattern from typing import Any, Generator, cast from pydantic import BaseModel, Field, model_validator from inspect_ai._util.config import read_config_object +from inspect_ai._util.format import format_function_call from inspect_ai._util.registry import registry_create, registry_lookup from inspect_ai.solver._task_state import TaskState from inspect_ai.tool._tool_call import ToolCall, ToolCallView @@ -30,17 +30,23 @@ def policy_approver(policies: str | list[ApprovalPolicy]) -> Approver: policies = approval_policies_from_config(policies) # compile policy into approvers and regexes for matching - policy_matchers: list[tuple[list[Pattern[str]], Approver]] = [] + policy_matchers: list[tuple[list[str], Approver]] = [] for policy in policies: tools = [policy.tools] if isinstance(policy.tools, str) else policy.tools - patterns = [re.compile(fnmatch.translate(tool)) for tool in tools] - policy_matchers.append((patterns, policy.approver)) + globs = [f"{tool}*" for tool in tools] + policy_matchers.append((globs, policy.approver)) # generator for policies that match a tool_call def tool_approvers(tool_call: ToolCall) -> Generator[Approver, None, None]: for policy_matcher in iter(policy_matchers): + function_call = format_function_call( + tool_call.function, tool_call.arguments, width=sys.maxsize + ) if any( - [pattern.match(tool_call.function) for pattern in policy_matcher[0]] + [ + fnmatch.fnmatch(function_call, pattern) + for pattern in policy_matcher[0] + ] ): yield policy_matcher[1] diff --git a/src/inspect_ai/model/_conversation.py b/src/inspect_ai/model/_conversation.py index 6b3dd6aa4..44892347d 100644 --- a/src/inspect_ai/model/_conversation.py +++ b/src/inspect_ai/model/_conversation.py @@ -1,6 +1,7 @@ from rich.console import RenderableType from rich.text import Text +from inspect_ai._util.constants import NO_CONTENT from inspect_ai._util.rich import lines_display from inspect_ai._util.transcript import transcript_markdown from inspect_ai.util._conversation import conversation_panel @@ -15,13 +16,16 @@ def conversation_tool_mesage(message: ChatMessageTool) -> None: if display_type() == "conversation": # truncate output to 100 lines - output = message.error.message if message.error else message.text.strip() - content = lines_display(output, 100) - - conversation_panel( - title=f"Tool Output: {message.function}", - content=content, + output = ( + message.error.message.strip() if message.error else message.text.strip() ) + if output: + content = lines_display(output, 100) + + conversation_panel( + title=f"Tool Output: {message.function}", + content=content, + ) def conversation_assistant_message( @@ -37,12 +41,15 @@ def conversation_assistant_message( # start with assistant content content: list[RenderableType] = ( - [transcript_markdown(message.text, escape=True)] if message.text else [] + [transcript_markdown(message.text, escape=True)] + if message.text and message.text != NO_CONTENT + else [] ) # print tool calls if message.tool_calls: - content.append(Text()) + if content: + content.append(Text()) content.extend(render_tool_calls(message.tool_calls)) # print the assistant message diff --git a/src/inspect_ai/model/_generate_config.py b/src/inspect_ai/model/_generate_config.py index a29cc8ce4..3e931afdd 100644 --- a/src/inspect_ai/model/_generate_config.py +++ b/src/inspect_ai/model/_generate_config.py @@ -66,6 +66,9 @@ class GenerateConfigArgs(TypedDict, total=False): parallel_tool_calls: bool | None """Whether to enable parallel function calling during tool use (defaults to True). OpenAI and Groq only.""" + internal_tools: bool | None + """Whether to automatically map tools to model internal implementations (e.g. 'computer' for anthropic).""" + max_tool_output: int | None """Maximum tool output (in bytes). Defaults to 16 * 1024.""" @@ -136,6 +139,9 @@ class GenerateConfig(BaseModel): parallel_tool_calls: bool | None = Field(default=None) """Whether to enable parallel function calling during tool use (defaults to True). OpenAI and Groq only.""" + internal_tools: bool | None = Field(default=None) + """Whether to automatically map tools to model internal implementations (e.g. 'computer' for anthropic).""" + max_tool_output: int | None = Field(default=None) """Maximum tool output (in bytes). Defaults to 16 * 1024.""" diff --git a/src/inspect_ai/model/_model.py b/src/inspect_ai/model/_model.py index d4ff06e82..c554dc602 100644 --- a/src/inspect_ai/model/_model.py +++ b/src/inspect_ai/model/_model.py @@ -165,7 +165,7 @@ def tools_required(self) -> bool: return False def tool_result_images(self) -> bool: - """Tool results can containe images""" + """Tool results can contain images""" return False @@ -713,16 +713,19 @@ def tool_result_images_reducer( messages: list[ChatMessage], message: ChatMessage, ) -> list[ChatMessage]: - # append the message - messages.append(message) - # if there are tool result images, pull them out into a ChatUserMessage if isinstance(message, ChatMessageTool) and isinstance(message.content, list): + tool_message = ChatMessageTool( + content=message.content.copy(), tool_call_id=message.tool_call_id + ) + assert isinstance(tool_message.content, list) + messages.append(tool_message) + user_content: list[Content] = [] - for i in range(0, len(message.content)): - if isinstance(message.content[i], ContentImage): + for i in range(0, len(tool_message.content)): + if isinstance(tool_message.content[i], ContentImage): user_content.append(message.content[i]) - message.content[i] = ContentText( + tool_message.content[i] = ContentText( text="Image content is in the message below." ) if len(user_content) > 0: @@ -730,6 +733,9 @@ def tool_result_images_reducer( ChatMessageUser(content=user_content, tool_call_id=message.tool_call_id) ) + else: + messages.append(message) + # return messages return messages diff --git a/src/inspect_ai/model/_providers/anthropic.py b/src/inspect_ai/model/_providers/anthropic.py index a3df84d10..2b4f77b79 100644 --- a/src/inspect_ai/model/_providers/anthropic.py +++ b/src/inspect_ai/model/_providers/anthropic.py @@ -1,8 +1,14 @@ import functools import os +import sys from copy import copy from logging import getLogger -from typing import Any, Literal, Tuple, cast +from typing import Any, Literal, Tuple, TypedDict, cast + +if sys.version_info >= (3, 11): + from typing import NotRequired +else: + from typing_extensions import NotRequired from anthropic import ( APIConnectionError, @@ -27,7 +33,11 @@ from pydantic import JsonValue from typing_extensions import override -from inspect_ai._util.constants import BASE_64_DATA_REMOVED, DEFAULT_MAX_RETRIES +from inspect_ai._util.constants import ( + BASE_64_DATA_REMOVED, + DEFAULT_MAX_RETRIES, + NO_CONTENT, +) from inspect_ai._util.content import Content, ContentImage, ContentText from inspect_ai._util.error import exception_message from inspect_ai._util.images import file_as_data_uri @@ -35,20 +45,11 @@ from inspect_ai._util.url import data_uri_mime_type, data_uri_to_base64 from inspect_ai.tool import ToolCall, ToolChoice, ToolFunction, ToolInfo -from .._chat_message import ( - ChatMessage, - ChatMessageAssistant, - ChatMessageSystem, -) +from .._chat_message import ChatMessage, ChatMessageAssistant, ChatMessageSystem from .._generate_config import GenerateConfig from .._model import ModelAPI from .._model_call import ModelCall -from .._model_output import ( - ChatCompletionChoice, - ModelOutput, - ModelUsage, - StopReason, -) +from .._model_output import ChatCompletionChoice, ModelOutput, ModelUsage, StopReason from .util import environment_prerequisite_error, model_base_url logger = getLogger(__name__) @@ -142,7 +143,7 @@ def model_call() -> ModelCall: system_param, tools_param, messages, - cache_prompt, + computer_use, ) = await resolve_chat_input(self.model_name, input, tools, config) # prepare request params (assembed this way so we can log the raw model call) @@ -158,13 +159,11 @@ def model_call() -> ModelCall: # additional options request = request | self.completion_params(config) - # caching header - if cache_prompt: - request["extra_headers"] = { - "anthropic-beta": "prompt-caching-2024-07-31" - } + # computer use beta + if computer_use: + request["extra_headers"] = {"anthropic-beta": "computer-use-2024-10-22"} - # call model + # make request message = await self.client.messages.create(**request, stream=False) # set response for ModelCall @@ -256,6 +255,9 @@ def handle_bad_request(self, ex: BadRequestError) -> ModelOutput | None: elif "content filtering" in error: content = "Sorry, but I am unable to help with that request." stop_reason = "content_filter" + else: + content = error + stop_reason = "unknown" if content and stop_reason: return ModelOutput.from_content( @@ -268,12 +270,26 @@ def handle_bad_request(self, ex: BadRequestError) -> ModelOutput | None: return None +# native anthropic tool definitions for computer use beta +# https://docs.anthropic.com/en/docs/build-with-claude/computer-use +class ComputerUseToolParam(TypedDict): + type: str + name: str + display_width_px: NotRequired[int] + display_height_px: NotRequired[int] + display_number: NotRequired[int] + + +# tools can be either a stock tool param or a special computer use tool param +ToolParamDef = ToolParam | ComputerUseToolParam + + async def resolve_chat_input( model: str, input: list[ChatMessage], tools: list[ToolInfo], config: GenerateConfig, -) -> Tuple[list[TextBlockParam] | None, list[ToolParam], list[MessageParam], bool]: +) -> Tuple[list[TextBlockParam] | None, list[ToolParamDef], list[MessageParam], bool]: # extract system message system_messages, messages = split_system_messages(input, config) @@ -286,14 +302,7 @@ async def resolve_chat_input( ) # tools - tools_params = [ - ToolParam( - name=tool.name, - description=tool.description, - input_schema=tool.parameters.model_dump(exclude_none=True), - ) - for tool in tools - ] + tools_params, computer_use = tool_params_for_tools(tools, config) # system messages if len(system_messages) > 0: @@ -343,10 +352,66 @@ async def resolve_chat_input( add_cache_control(cast(dict[str, Any], content[-1])) # return chat input - return system_param, tools_params, message_params, cache_prompt + return system_param, tools_params, message_params, computer_use + + +def tool_params_for_tools( + tools: list[ToolInfo], config: GenerateConfig +) -> tuple[list[ToolParamDef], bool]: + # tool params and computer_use bit to return + tool_params: list[ToolParamDef] = [] + computer_use = False + + # for each tool, check if it has a native computer use implementation and use that + # when available (noting that we need to set the computer use request header) + for tool in tools: + computer_use_tool = ( + computer_use_tool_param(tool) + if config.internal_tools is not False + else None + ) + if computer_use_tool: + tool_params.append(computer_use_tool) + computer_use = True + else: + tool_params.append( + ToolParam( + name=tool.name, + description=tool.description, + input_schema=tool.parameters.model_dump(exclude_none=True), + ) + ) + return tool_params, computer_use -def add_cache_control(param: TextBlockParam | ToolParam | dict[str, Any]) -> None: + +def computer_use_tool_param(tool: ToolInfo) -> ComputerUseToolParam | None: + # check for compatible 'computer' tool + if tool.name == "computer" and ( + sorted(tool.parameters.properties.keys()) + == sorted(["action", "coordinate", "text"]) + ): + return ComputerUseToolParam( + type="computer_20241022", + name="computer", + # Note: The dimensions passed here for display_width_px and display_height_px should + # match the dimensions of screenshots returned by the tool. + # Those dimensions will always be one of the values in MAX_SCALING_TARGETS + # in _x11_client.py. + # TODO: enhance this code to calculate the dimensions based on the scaled screen + # size used by the container. + display_width_px=1366, + display_height_px=768, + display_number=1, + ) + # not a computer_use tool + else: + return None + + +def add_cache_control( + param: TextBlockParam | ToolParam | ComputerUseToolParam | dict[str, Any], +) -> None: cast(dict[str, Any], param)["cache_control"] = {"type": "ephemeral"} @@ -404,11 +469,6 @@ def message_tool_choice(tool_choice: ToolChoice) -> message_create_params.ToolCh return {"type": "auto"} -# text we insert when there is no content passed -# (as this will result in an Anthropic API error) -NO_CONTENT = "(no content)" - - async def message_param(message: ChatMessage) -> MessageParam: # no system role for anthropic (this is more like an assertion, # as these should have already been filtered out) diff --git a/src/inspect_ai/model/_providers/google.py b/src/inspect_ai/model/_providers/google.py index c8227b40c..bbe90fa29 100644 --- a/src/inspect_ai/model/_providers/google.py +++ b/src/inspect_ai/model/_providers/google.py @@ -51,7 +51,7 @@ from pydantic import JsonValue from typing_extensions import override -from inspect_ai._util.constants import BASE_64_DATA_REMOVED +from inspect_ai._util.constants import BASE_64_DATA_REMOVED, NO_CONTENT from inspect_ai._util.content import ( Content, ContentAudio, @@ -316,9 +316,6 @@ def consective_tool_message_reducer( return messages -NO_CONTENT = "(no content)" - - async def content_dict( message: ChatMessageUser | ChatMessageAssistant | ChatMessageTool, ) -> ContentDict: diff --git a/src/inspect_ai/model/_providers/mistral.py b/src/inspect_ai/model/_providers/mistral.py index 95baf3339..fb69f673c 100644 --- a/src/inspect_ai/model/_providers/mistral.py +++ b/src/inspect_ai/model/_providers/mistral.py @@ -40,6 +40,7 @@ # https://github.com/mistralai/client-python/blob/main/MIGRATION.md from inspect_ai._util.constants import ( DEFAULT_TIMEOUT, + NO_CONTENT, ) from inspect_ai._util.content import Content, ContentImage, ContentText from inspect_ai._util.images import file_as_data_uri @@ -326,9 +327,6 @@ async def mistral_chat_message( ) -NO_CONTENT = "(no content)" - - async def mistral_message_content( content: str | list[Content], ) -> str | list[ContentChunk]: diff --git a/src/inspect_ai/model/_providers/vertex.py b/src/inspect_ai/model/_providers/vertex.py index 5eee17b7e..ca827c0be 100644 --- a/src/inspect_ai/model/_providers/vertex.py +++ b/src/inspect_ai/model/_providers/vertex.py @@ -23,7 +23,7 @@ ) from vertexai.generative_models import Content as VertexContent -from inspect_ai._util.constants import BASE_64_DATA_REMOVED +from inspect_ai._util.constants import BASE_64_DATA_REMOVED, NO_CONTENT from inspect_ai._util.content import ( Content, ContentAudio, @@ -250,9 +250,6 @@ def consective_tool_message_reducer( return messages -NO_CONTENT = "(no content)" - - async def content_dict( message: ChatMessageUser | ChatMessageAssistant | ChatMessageTool, ) -> VertexContent: diff --git a/src/inspect_ai/tool/beta/__init__.py b/src/inspect_ai/tool/beta/__init__.py new file mode 100644 index 000000000..a4ea44dbb --- /dev/null +++ b/src/inspect_ai/tool/beta/__init__.py @@ -0,0 +1,5 @@ +from ._computer import computer + +__all__ = [ + "computer", +] diff --git a/src/inspect_ai/tool/beta/_computer/__init__.py b/src/inspect_ai/tool/beta/_computer/__init__.py new file mode 100644 index 000000000..908766c87 --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/__init__.py @@ -0,0 +1,3 @@ +from ._computer import computer + +__all__ = ["computer"] diff --git a/src/inspect_ai/tool/beta/_computer/_common.py b/src/inspect_ai/tool/beta/_computer/_common.py new file mode 100644 index 000000000..63329fe69 --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_common.py @@ -0,0 +1,134 @@ +import json +from textwrap import dedent +from typing import Literal + +from pydantic import BaseModel, Field + +from inspect_ai._util.content import ContentText +from inspect_ai._util.error import PrerequisiteError +from inspect_ai.model import ContentImage +from inspect_ai.tool import ToolError, ToolResult +from inspect_ai.util._sandbox.context import sandbox_with +from inspect_ai.util._sandbox.environment import SandboxEnvironment + +Action = Literal[ + "key", + "type", + "mouse_move", + "left_click", + "left_click_drag", + "right_click", + "middle_click", + "double_click", + "screenshot", + "cursor_position", +] + + +class ToolExecResult(BaseModel): + output: str | None = Field(default=None) + error: str | None = Field(default=None) + base64_image: str | None = Field(default=None) + + +async def _send_cmd(cmdTail: list[str], timeout: int | None = None) -> ToolResult: + from inspect_ai.log._samples import sample_active + + sample = sample_active() + assert sample + sample_id = sample.sample.id + assert sample_id + + cmd = ["python3", "-m", "computer_tool.computer_tool", "--action"] + cmdTail + + raw_exec_result = await (await computer_sandbox()).exec(cmd, timeout=timeout) + + if not raw_exec_result.success: + raise RuntimeError( + f"Failure executing command: ${cmd} {raw_exec_result.stderr}" + ) + + result = ToolExecResult(**json.loads(raw_exec_result.stdout)) + + if result.error: + raise ToolError(result.error) + + image = ( + ContentImage(image=f"data:image/png;base64,{result.base64_image}") + if result.base64_image + else None + ) + text = result.output if result.output and len(result.output) > 0 else None + + if text is not None and image is not None: + return [ContentText(text=text), image] + + if text is not None: + return text + + if image is not None: + return [image] + + return "OK" + + +async def cursor_position(timeout: int | None = None) -> ToolResult: + return await _send_cmd(["cursor_position"], timeout=timeout) + + +async def screenshot(timeout: int | None = None) -> ToolResult: + return await _send_cmd(["screenshot"], timeout=timeout) + + +async def mouse_move(x: int, y: int, timeout: int | None = None) -> ToolResult: + return await _send_cmd( + ["mouse_move", "--coordinate", f"{x}", f"{y}"], timeout=timeout + ) + + +async def left_click(timeout: int | None = None) -> ToolResult: + return await _send_cmd(["left_click"], timeout=timeout) + + +async def left_click_drag(x: int, y: int, timeout: int | None = None) -> ToolResult: + return await _send_cmd( + ["left_click_drag", "--coordinate", f"{x}", f"{y}"], timeout=timeout + ) + + +async def right_click(timeout: int | None = None) -> ToolResult: + return await _send_cmd(["right_click"], timeout=timeout) + + +async def middle_click(timeout: int | None = None) -> ToolResult: + return await _send_cmd(["middle_click"], timeout=timeout) + + +async def double_click(timeout: int | None = None) -> ToolResult: + return await _send_cmd(["double_click"], timeout=timeout) + + +async def press_key(key: str, timeout: int | None = None) -> ToolResult: + return await _send_cmd(["key", "--text", key], timeout=timeout) + + +async def type(text: str, timeout: int | None = None) -> ToolResult: + return await _send_cmd(["type", "--text", text], timeout=timeout) + + +async def computer_sandbox() -> SandboxEnvironment: + sb = await sandbox_with("/opt/computer_tool/computer_tool.py") + if sb: + return sb + else: + raise PrerequisiteError( + dedent(""" + The computer tool service was not found in any of the sandboxes for this sample. Please add the computer tool service to your configuration. For example, the following Docker compose file uses the (currently internal) inspect-computer-tool-beta image as its default sandbox: + + services: + default: + # Temporary internal image until the official one is available + image: "inspect-computer-tool-beta" + init: true + """).strip() + ) diff --git a/src/inspect_ai/tool/beta/_computer/_computer.py b/src/inspect_ai/tool/beta/_computer/_computer.py new file mode 100644 index 000000000..7e7ecfe99 --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_computer.py @@ -0,0 +1,127 @@ +from typing import Awaitable, Callable + +from inspect_ai.tool import Tool, ToolResult, tool +from inspect_ai.tool._tool import ToolParsingError + +from . import _common as common +from ._common import Action + +ActionFunction = Callable[[str], ToolResult | Awaitable[ToolResult]] + + +@tool() +def computer(timeout: int | None = 180) -> Tool: + """ + Computer interaction tool. + + Args: + timeout (int | None): Timeout (in seconds) for command. + + Returns: + Computer interaction tool. + """ + + async def execute( + action: Action, + text: str | None = None, + coordinate: list[int] | None = None, + ) -> ToolResult: + """ + Use this tool to interact with a computer. + + Use a mouse and keyboard to interact with a computer's desktop GUI. + + Keep in mind that icons require double clicks to open while other UI affordances like menu items and buttons require a single click. + + Args: + action (Action): The action to perform. + - `key`: Press a key or key-combination on the keyboard. + - Example: execute(action="key", text="ctrl+s") + - Text can be any key name supported by xdotool's `key` such as: + "Return", "Escape", "alt+Tab", "BackSpace", "Tab", "alt+Tab", "ctrl+s", "Up", "KP_0" (for the numpad 0 key), + "Insert", "Delete", "Home", "End", "Prior", "Next", "Left", "Up", "Right", "Down", + "F1", "F2", "F3", "F4", "F5", "F6", "F7", "F8", "F9", "F10", "F11", "F12", + "Shift_L", "Shift_R", "Control_L", "Control_R", "Alt_L", "Alt_R", "Scroll_Lock", "Num_Lock", "Caps_Lock", "Pause", + "KP_Multiply", "KP_Home", "KP_Up", "KP_Prior", "KP_Subtract", "KP_Left", "KP_Begin", "KP_Right", "KP_Add", "KP_End","KP_Down", + "KP_Next", "KP_Insert", "KP_Delete", "KP_Enter", "KP_Divide", "KP_Equal", "KP_Decimal", + - `type`: Type a string of text on the keyboard. If the text contains spaces, enclose it in quotes. + - Example: execute(action="type", text="The crux of the biscuit is the apostrophe!") + - `cursor_position`: Get the current (x, y) pixel coordinate of the cursor on the screen. + - `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen. + - Example: execute(action="mouse_move", coordinate=(100, 200)) + - `left_click`: Click the left mouse button. + - `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen. + - Example: execute(action="left_click_drag", coordinate=(150, 250)) + - `right_click`: Click the right mouse button. + - `middle_click`: Click the middle mouse button. + - `double_click`: Double-click the left mouse button. + - `screenshot`: Take a screenshot. + text (str | None): The text to type or the key to press. Required when action is "key" or "type". + coordinate (tuple[int, int] | None): The (x, y) pixel coordinate on the screen to which to move or drag. Required when action is "mouse_move" or "left_click_drag". + + Returns: + The output of the command. Many commands will include a screenshot reflecting the result of the command in their output. + """ + if action in ("mouse_move", "left_click_drag"): + if coordinate is None: + raise ToolParsingError(f"coordinate is required for {action}") + if text is not None: + raise ToolParsingError(f"text is not accepted for {action}") + if not isinstance(coordinate, list) or len(coordinate) != 2: + raise ToolParsingError(f"{coordinate} must be a tuple of length 2") + if not all(isinstance(i, int) and i >= 0 for i in coordinate): + raise ToolParsingError( + f"{coordinate} must be a tuple of non-negative ints" + ) + + if action == "mouse_move": + return await common.mouse_move( + coordinate[0], coordinate[1], timeout=timeout + ) + elif action == "left_click_drag": + return await common.left_click_drag( + coordinate[0], coordinate[1], timeout=timeout + ) + + if action in ("key", "type"): + if text is None: + raise ToolParsingError(f"text is required for {action}") + if coordinate is not None: + raise ToolParsingError(f"coordinate is not accepted for {action}") + if not isinstance(text, str): + raise ToolParsingError(output=f"{text} must be a string") + + if action == "key": + return await common.press_key(text, timeout=timeout) + elif action == "type": + return await common.type(text, timeout=timeout) + + if action in ( + "left_click", + "right_click", + "double_click", + "middle_click", + "screenshot", + "cursor_position", + ): + if text is not None: + raise ToolParsingError(f"text is not accepted for {action}") + if coordinate is not None: + raise ToolParsingError(f"coordinate is not accepted for {action}") + + if action == "screenshot": + return await common.screenshot(timeout=timeout) + elif action == "cursor_position": + return await common.cursor_position(timeout=timeout) + elif action == "left_click": + return await common.left_click(timeout=timeout) + elif action == "right_click": + return await common.right_click(timeout=timeout) + elif action == "middle_click": + return await common.middle_click(timeout=timeout) + elif action == "double_click": + return await common.double_click(timeout=timeout) + + raise ToolParsingError(f"Invalid action: {action}") + + return execute diff --git a/src/inspect_ai/tool/beta/_computer/_computer_split.py b/src/inspect_ai/tool/beta/_computer/_computer_split.py new file mode 100644 index 000000000..0faab95dd --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_computer_split.py @@ -0,0 +1,198 @@ +""" +This module provides the same functionality as the computer tool but via a list of per-action tools . e.g. computer_mouse_move(100, 100). + +The split version is not publicly exported, but is retained until we decide if it performs better than the monolithic computer tool. +""" + +from typing import Awaitable, Callable + +from inspect_ai.tool import Tool, ToolResult, tool + +from . import _common as common + +ActionFunction = Callable[[str], ToolResult | Awaitable[ToolResult]] + + +def computer_split(timeout: int | None = None) -> list[Tool]: + """ + Computer interaction tools. + + Args: + timeout (int | None): Timeout (in seconds) for command. + + Returns: + List of computer interaction tools. + """ + return [ + computer_cursor_position(), + computer_screenshot(), + computer_mouse_move(), + computer_left_click(), + computer_double_click(), + computer_left_click_drag(), + computer_right_click(), + computer_key(), + computer_type(), + ] + + +@tool() +def computer_cursor_position(timeout: int | None = None) -> Tool: + async def execute() -> ToolResult: + """ + Get the current (x, y) pixel coordinate of the cursor on the screen. + + Args: + None + + Returns: + A `str` of the form "x y" where x and y are the current mouse coordinates. + """ + return await common.cursor_position(timeout=timeout) + + return execute + + +@tool() +def computer_screenshot(timeout: int | None = None) -> Tool: + async def execute() -> ToolResult: + """ + Take a screenshot. + + Args: + None + + Returns: + A `list` with a single `ContentImage` of the screen. + """ + return await common.screenshot(timeout=timeout) + + return execute + + +@tool() +def computer_mouse_move(timeout: int | None = None) -> Tool: + async def execute(x: int, y: int) -> ToolResult: + """ + Move the cursor to a specified (x, y) pixel coordinate on the screen. + + Args: + x: X coordinate of the mouse destination. + y: Y coordinate of the mouse destination. + + Returns: + A `list` with a single `ContentImage` of the screen. + """ + return await common.mouse_move(x, y, timeout=timeout) + + return execute + + +@tool() +def computer_left_click(timeout: int | None = None) -> Tool: + async def execute() -> ToolResult: + """ + Click the left mouse button. + + Args: + None + + Returns: + A `list` with a single `ContentImage` of the screen. + """ + return await common.left_click(timeout=timeout) + + return execute + + +@tool() +def computer_double_click(timeout: int | None = None) -> Tool: + async def execute() -> ToolResult: + """ + Double-click the left mouse button. + + Args: + None + + Returns: + A `list` with a single `ContentImage` of the screen. + """ + return await common.double_click(timeout=timeout) + + return execute + + +@tool() +def computer_left_click_drag(timeout: int | None = None) -> Tool: + async def execute(x: int, y: int) -> ToolResult: + """ + Click and drag the cursor to a specified (x, y) pixel coordinate on the screen. + + Args: + x: X coordinate of the mouse destination. + y: Y coordinate of the mouse destination. + + Returns: + A `list` with a single `ContentImage` of the screen. + """ + return await common.left_click_drag(x, y, timeout=timeout) + + return execute + + +@tool() +def computer_right_click(timeout: int | None = None) -> Tool: + async def execute() -> ToolResult: + """ + Click the right mouse button. + + Args: + None + + Returns: + A `list` with a single `ContentImage` of the screen. + """ + return await common.right_click(timeout=timeout) + + return execute + + +# keysm list is from https://gist.github.com/rvaiya/be31f42049a4b5ad46666a8e120d9843 +@tool() +def computer_key(timeout: int | None = None) -> Tool: + async def execute(key: str) -> ToolResult: + """ + Press a key or key-combination on the keyboard. + + Args: + key: The key or key-combination to press. Can be any key name supported by xdotool's `key` such as: + "Return", "Escape", "alt+Tab", "BackSpace", "Tab", "alt+Tab", "ctrl+s", "Up", "KP_0" (for the numpad 0 key), + "Insert", "Delete", "Home", "End", "Prior", "Next", "Left", "Up", "Right", "Down", + "F1", "F2", "F3", "F4", "F5", "F6", "F7", "F8", "F9", "F10", "F11", "F12", + "Shift_L", "Shift_R", "Control_L", "Control_R", "Alt_L", "Alt_R", "Scroll_Lock", "Num_Lock", "Caps_Lock", "Pause", + "KP_Multiply", "KP_Home", "KP_Up", "KP_Prior", "KP_Subtract", "KP_Left", "KP_Begin", "KP_Right", "KP_Add", "KP_End","KP_Down", + "KP_Next", "KP_Insert", "KP_Delete", "KP_Enter", "KP_Divide", "KP_Equal", "KP_Decimal" + + Returns: + A `list` with a single `ContentImage` of the screen. + """ + return await common.press_key(key, timeout=timeout) + + return execute + + +@tool() +def computer_type(timeout: int | None = None) -> Tool: + async def execute(text: str) -> ToolResult: + """ + Type a string of text on the keyboard. + + Args: + text: The text to type. If the text contains spaces, enclose it in quotes. + + Returns: + A `list` with a single `ContentImage` of the screen. + """ + return await common.type(text, timeout=timeout) + + return execute diff --git a/src/inspect_ai/tool/beta/_computer/_resources/Dockerfile b/src/inspect_ai/tool/beta/_computer/_resources/Dockerfile new file mode 100644 index 000000000..cc851938a --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/Dockerfile @@ -0,0 +1,109 @@ +FROM docker.io/ubuntu:22.04 + +ENV DEBIAN_FRONTEND=noninteractive +ENV DEBIAN_PRIORITY=high + +# Core/system layer +RUN apt-get update && \ + apt-get -y upgrade && \ + apt-get -y install \ + # A virtual framebuffer for running GUI applications without a physical display. + xvfb \ + # A terminal emulator for X. + xterm \ + # A command-line tool for automating X11 applications (e.g., simulating keyboard/mouse inputs). + xdotool \ + # A command-line tool for taking screenshots. + scrot \ + # A suite for image manipulation — needed for scaling images. + imagemagick \ + sudo \ + # A lightweight window manager. + mutter \ + # A VNC server for sharing X11 desktops. + x11vnc \ + # A web based VNC client + novnc \ + # A WebSocket to TCP proxy/bridge for noVNC + websockify \ + # Python reqs + python3 \ + python3-pip \ + # Network tools + # Provides networking tools like ifconfig, netstat, etc. + net-tools \ + # A versatile networking tool for debugging, port scanning, and more. + netcat && \ + apt-get clean + +# Userland apps +RUN apt-get install -y --no-install-recommends \ + # A lightweight PDF viewer. + xpdf \ + # A simple image viewer. + xpaint \ + # A lightweight taskbar for graphical desktops. + tint2 \ + # A calculator application. + galculator \ + # A lightweight file manager. + pcmanfm && \ + apt-get clean + +# install Firefox +RUN apt-get install -y software-properties-common && \ + add-apt-repository ppa:mozillateam/ppa && \ + apt-get update && \ + apt-get install -y --no-install-recommends firefox-esr && \ + apt-get clean + +# install VS Code +RUN apt-get install -y \ + gpg \ + wget \ + apt-transport-https \ + software-properties-common && \ + wget -qO- https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor > packages.microsoft.gpg && \ + install -D -o root -g root -m 644 packages.microsoft.gpg /etc/apt/keyrings/packages.microsoft.gpg && \ + sh -c 'echo "deb [arch=amd64,arm64 signed-by=/etc/apt/keyrings/packages.microsoft.gpg] https://packages.microsoft.com/repos/code stable main" > /etc/apt/sources.list.d/vscode.list' && \ + apt-get update && \ + apt-get install -y code && \ + apt-get clean + +# configure noVNC +RUN ln -s /usr/share/novnc/vnc.html /usr/share/novnc/index.html + +# setup user +ENV USERNAME=computeruse +ENV HOME=/home/$USERNAME +RUN useradd -m -s /bin/bash -d $HOME $USERNAME +RUN echo "${USERNAME} ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers +USER computeruse +WORKDIR $HOME + +# configure Firefox to skip all 'first run' UI +RUN mkdir -p $HOME/.mozilla/firefox-esr/profile.default && \ + echo 'user_pref("browser.startup.homepage_override.mstone", "ignore");' >> $HOME/.mozilla/firefox-esr/profile.default/user.js && \ + echo 'user_pref("browser.aboutwelcome.enabled", false);' >> $HOME/.mozilla/firefox-esr/profile.default/user.js && \ + echo 'user_pref("datareporting.policy.firstRunURL", "");' >> $HOME/.mozilla/firefox-esr/profile.default/user.js + +# only reinstall if requirements.txt changes +COPY computer_tool/requirements.txt /opt/computer_tool/requirements.txt +RUN cd /opt/computer_tool && pip3 install --no-cache-dir -r requirements.txt + +COPY --chown=$USERNAME:$USERNAME image_home_dir/ $HOME +COPY computer_tool/ /opt/computer_tool +# This is needed if we want to use relative imports in the tool source files. +ENV PYTHONPATH=/opt + +EXPOSE 5900 +EXPOSE 6080 + +ARG DISPLAY_NUM=1 +ARG WIDTH=1920 +ARG HEIGHT=1080 +ENV DISPLAY_NUM=$DISPLAY_NUM +ENV HEIGHT=$HEIGHT +ENV WIDTH=$WIDTH + +ENTRYPOINT [ "./entrypoint.sh" ] diff --git a/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/__init__.py b/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/_logger.py b/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/_logger.py new file mode 100644 index 000000000..c3a3a42fe --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/_logger.py @@ -0,0 +1,22 @@ +import logging + + +def setup_logger(level=logging.INFO): + """ + This logger emits all of its output to PID 1's stdout. + + This makes it so that logging from invocations of the computer_tool cli show up in `docker logs` output. + """ + new_logger = logging.getLogger("computer_tool") + new_logger.setLevel(level) + + stdout_handler = logging.FileHandler("/proc/1/fd/1", mode="w") + stdout_handler.setLevel(level) + stdout_handler.setFormatter( + logging.Formatter("%(name)s(pid=%(process)d) - %(levelname)s - %(message)s") + ) + + if not new_logger.handlers: + new_logger.addHandler(stdout_handler) + + return new_logger diff --git a/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/_run.py b/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/_run.py new file mode 100644 index 000000000..89db980ac --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/_run.py @@ -0,0 +1,42 @@ +"""Utility to run shell commands asynchronously with a timeout.""" + +import asyncio + +TRUNCATED_MESSAGE: str = "To save on context only part of this file has been shown to you. You should retry this tool after you have searched inside the file with `grep -n` in order to find the line numbers of what you are looking for." +MAX_RESPONSE_LEN: int = 16000 + + +def maybe_truncate(content: str, truncate_after: int | None = MAX_RESPONSE_LEN): + """Truncate content and append a notice if content exceeds the specified length.""" + return ( + content + if not truncate_after or len(content) <= truncate_after + else content[:truncate_after] + TRUNCATED_MESSAGE + ) + + +async def run( + cmd: str, + timeout: float | None = 120.0, # seconds + truncate_after: int | None = MAX_RESPONSE_LEN, +): + """Run a shell command asynchronously with a timeout.""" + process = await asyncio.create_subprocess_shell( + cmd, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE + ) + + try: + stdout, stderr = await asyncio.wait_for(process.communicate(), timeout=timeout) + return ( + process.returncode or 0, + maybe_truncate(stdout.decode(), truncate_after=truncate_after), + maybe_truncate(stderr.decode(), truncate_after=truncate_after), + ) + except asyncio.TimeoutError as exc: + try: + process.kill() + except ProcessLookupError: + pass + raise TimeoutError( + f"Command '{cmd}' timed out after {timeout} seconds" + ) from exc diff --git a/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/_tool_result.py b/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/_tool_result.py new file mode 100644 index 000000000..138f85e4a --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/_tool_result.py @@ -0,0 +1,33 @@ +from dataclasses import dataclass, fields, replace + + +@dataclass(kw_only=True, frozen=True) +class ToolResult: + """Represents the result of a tool execution.""" + + output: str | None = None + error: str | None = None + base64_image: str | None = None + + def __bool__(self): + return any(getattr(self, field.name) for field in fields(self)) + + def __add__(self, other: "ToolResult"): + def combine_fields( + field: str | None, other_field: str | None, concatenate: bool = True + ): + if field and other_field: + if concatenate: + return field + other_field + raise ValueError("Cannot combine tool results") + return field or other_field + + return ToolResult( + output=combine_fields(self.output, other.output), + error=combine_fields(self.error, other.error), + base64_image=combine_fields(self.base64_image, other.base64_image, False), + ) + + def replace(self, **kwargs): + """Returns a new ToolResult with the given fields replaced.""" + return replace(self, **kwargs) diff --git a/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/_x11_client.py b/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/_x11_client.py new file mode 100644 index 000000000..48a72018f --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/_x11_client.py @@ -0,0 +1,262 @@ +"""Based on https://github.com/anthropics/anthropic-quickstarts/blob/main/computer-use-demo/computer_use_demo/tools/computer.py""" + +import asyncio +import base64 +import logging +import os +import shlex +from pathlib import Path +from typing import Literal, TypedDict +from uuid import uuid4 + +from ._run import run +from ._tool_result import ToolResult + +OUTPUT_DIR = "/tmp/outputs" + +TYPING_DELAY_MS = 12 +TYPING_GROUP_SIZE = 50 + +ColorCount = Literal[4096, 2048, 1024, 512, 256, 128, 64, 32, 16, 8, 4] + +Action = Literal[ + "key", + "type", + "mouse_move", + "left_click", + "left_click_drag", + "right_click", + "middle_click", + "double_click", + "screenshot", + "cursor_position", +] + + +class ToolError(Exception): + def __init__(self, message): + self.message = message + + +class Resolution(TypedDict): + width: int + height: int + + +# sizes above XGA/WXGA are not recommended (see README.md) +# scale down to one of these targets if ComputerTool._scaling_enabled is set +MAX_SCALING_TARGETS: dict[str, Resolution] = { + "XGA": Resolution(width=1024, height=768), # 4:3 + "WXGA": Resolution(width=1280, height=800), # 16:10 + "FWXGA": Resolution(width=1366, height=768), # ~16:9 +} + + +ScalingSource = Literal["computer", "api"] + + +class ComputerToolOptions(TypedDict): + display_height_px: int + display_width_px: int + display_number: int | None + + +def chunks(s: str, chunk_size: int) -> list[str]: + return [s[i : i + chunk_size] for i in range(0, len(s), chunk_size)] + + +class X11Client: + """ + A tool that allows the agent to interact with the screen, keyboard, and mouse of the current computer. + + The tool parameters are defined by Anthropic and are not editable. + """ + + width: int + height: int + display_num: int | None + # TODO: Complete plumbing this or remove it + color_count: ColorCount | None = 256 + + _screenshot_delay = 2.0 + _scaling_enabled = True + + @property + def options(self) -> ComputerToolOptions: + width, height = self.scale_coordinates("computer", self.width, self.height) + return { + "display_width_px": width, + "display_height_px": height, + "display_number": self.display_num, + } + + def __init__(self): + super().__init__() + + self.width = int(os.getenv("WIDTH") or 0) + self.height = int(os.getenv("HEIGHT") or 0) + assert self.width and self.height, "WIDTH, HEIGHT must be set" + if (display_num := os.getenv("DISPLAY_NUM")) is not None: + self.display_num = int(display_num) + self._display_prefix = f"DISPLAY=:{self.display_num} " + else: + self.display_num = None + self._display_prefix = "" + + self.xdotool = f"{self._display_prefix}xdotool" + + async def __call__( + self, + *, + action: Action, + text: str | None = None, + coordinate: tuple[int, int] | None = None, + **kwargs, + ): + if action in ("mouse_move", "left_click_drag"): + if coordinate is None: + raise ToolError(f"coordinate is required for {action}") + if text is not None: + raise ToolError(f"text is not accepted for {action}") + if not isinstance(coordinate, list) or len(coordinate) != 2: + raise ToolError(f"{coordinate} must be a tuple of length 2") + if not all(isinstance(i, int) and i >= 0 for i in coordinate): + raise ToolError(f"{coordinate} must be a tuple of non-negative ints") + + x, y = self.scale_coordinates("api", coordinate[0], coordinate[1]) + + if action == "mouse_move": + return await self.shell(f"{self.xdotool} mousemove --sync {x} {y}") + elif action == "left_click_drag": + return await self.shell( + f"{self.xdotool} mousedown 1 mousemove --sync {x} {y} mouseup 1" + ) + + if action in ("key", "type"): + if text is None: + raise ToolError(f"text is required for {action}") + if coordinate is not None: + raise ToolError(f"coordinate is not accepted for {action}") + if not isinstance(text, str): + raise ToolError(output=f"{text} must be a string") + + if action == "key": + return await self.shell( + f"{self.xdotool} key -- {' '.join(shlex.quote(part) for part in text.split())}" + ) + elif action == "type": + results: list[ToolResult] = [] + for chunk in chunks(text, TYPING_GROUP_SIZE): + cmd = f"{self.xdotool} type --delay {TYPING_DELAY_MS} -- {shlex.quote(chunk)}" + results.append(await self.shell(cmd, take_screenshot=False)) + + screenshot_base64 = await self.take_screenshot_after_delay() + return ToolResult( + output="".join(result.output or "" for result in results), + error="".join(result.error or "" for result in results), + base64_image=screenshot_base64, + ) + + if action in ( + "left_click", + "right_click", + "double_click", + "middle_click", + "screenshot", + "cursor_position", + ): + if text is not None: + raise ToolError(f"text is not accepted for {action}") + if coordinate is not None: + raise ToolError(f"coordinate is not accepted for {action}") + + if action == "screenshot": + return await self.screenshot() + elif action == "cursor_position": + result = await self.shell( + f"{self.xdotool} getmouselocation --shell", + take_screenshot=False, + ) + output = result.output or "" + x, y = self.scale_coordinates( + "computer", + int(output.split("X=")[1].split("\n")[0]), + int(output.split("Y=")[1].split("\n")[0]), + ) + return result.replace(output=f"X={x},Y={y}") + else: + click_arg = { + "left_click": "1", + "right_click": "3", + "middle_click": "2", + "double_click": "--repeat 2 --delay 500 1", + }[action] + return await self.shell(f"{self.xdotool} click {click_arg}") + + raise ToolError(f"Invalid action: {action}") + + async def screenshot(self): + """Take a screenshot of the current screen and return the base64 encoded image.""" + output_dir = Path(OUTPUT_DIR) + output_dir.mkdir(parents=True, exist_ok=True) + path = output_dir / f"screenshot_{uuid4().hex}.png" + + result = await self.shell( + f"{self._display_prefix}scrot --silent -p {path}", take_screenshot=False + ) + if self._scaling_enabled: + x, y = self.scale_coordinates("computer", self.width, self.height) + convert_cmd = f"convert {path} -resize {x}x{y}!" + if self.color_count is not None: + convert_cmd += f" -colors {self.color_count}" + convert_cmd += f" {path}" + await self.shell(convert_cmd, take_screenshot=False) + + if path.exists(): + return result.replace( + base64_image=base64.b64encode(path.read_bytes()).decode() + ) + raise ToolError(f"Failed to take screenshot: {result.error}") + + async def shell(self, command: str, take_screenshot=True) -> ToolResult: + """Run a shell command and return the output, error, and optionally a screenshot.""" + logging.debug(f"running shell command {command}") + _, stdout, stderr = await run(command) + logging.debug(f"shell command returned stdout: {stdout}, stderr: {stderr}") + return ToolResult( + output=stdout, + error=stderr, + base64_image=(await self.take_screenshot_after_delay()) + if take_screenshot + else None, + ) + + async def take_screenshot_after_delay(self) -> str: + # delay to let things settle before taking a screenshot + await asyncio.sleep(self._screenshot_delay) + return (await self.screenshot()).base64_image + + def scale_coordinates(self, source: ScalingSource, x: int, y: int): + """Scale coordinates to a target maximum resolution.""" + if not self._scaling_enabled: + return x, y + ratio = self.width / self.height + target_dimension = None + for dimension in MAX_SCALING_TARGETS.values(): + # allow some error in the aspect ratio - not ratios are exactly 16:9 + if abs(dimension["width"] / dimension["height"] - ratio) < 0.02: + if dimension["width"] < self.width: + target_dimension = dimension + break + if target_dimension is None: + return x, y + # should be less than 1 + x_scaling_factor = target_dimension["width"] / self.width + y_scaling_factor = target_dimension["height"] / self.height + if source == "api": + if x > self.width or y > self.height: + raise ToolError(f"Coordinates {x}, {y} are out of bounds") + # scale up + return round(x / x_scaling_factor), round(y / y_scaling_factor) + # scale down + return round(x * x_scaling_factor), round(y * y_scaling_factor) diff --git a/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/computer_tool.py b/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/computer_tool.py new file mode 100644 index 000000000..1927d16c1 --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/computer_tool.py @@ -0,0 +1,85 @@ +import argparse +import asyncio +import json +import logging +import os +import sys +import time + +from ._logger import setup_logger +from ._tool_result import ToolResult +from ._x11_client import X11Client + +# This is a bit sketchy. We really want to use relative imports here. Using absolute imports +# works at runtime, but it prevents intellisense from working. However, when this folder is +# copied to the container, by default relative imports won't work if this file is launched +# normally. To overcome this, two things need to happen: +# 1. PYTHONPATH must be set to the parent of the container folder. `PYTHONPATH=/opt` +# 2. The program must be launched with the -m flag. `python3 -m computer_tool.computer_tool` +# +# TODO: There's got to be a cleaner way. + +my_logger = setup_logger(logging.INFO) + + +def main(): + try: + args = parse_arguments() + my_logger.info(f"({args})") + result = asyncio.run(execute_action(args)) + + print( + json.dumps( + { + "output": result.output, + "error": result.error, + "base64_image": result.base64_image, + } + ) + ) + my_logger.debug("SUCCESS") + except Exception as e: + my_logger.warning(f"An error occurred: {e}") + print(f"An error occurred: {e}", file=sys.stderr) + sys.exit(1) + + +def parse_arguments(): + parser = argparse.ArgumentParser(description="Execute computer tool action") + parser.add_argument("--action", type=str, required=True, help="Action to perform") + parser.add_argument("--text", type=str, help="Optional text parameter") + parser.add_argument( + "--coordinate", + type=int, + nargs=2, + help="Optional coordinate parameter as a list of two integers", + ) + return parser.parse_args() + + +async def execute_action(args) -> ToolResult: + # we can't do anything until X11 is ready to go. + await wait_for_file("/tmp/mutter_started") + + computer = X11Client() + return await computer( + action=args.action, + text=args.text, + coordinate=args.coordinate if args.coordinate else None, + ) + + +async def wait_for_file(file_path, check_interval=1): + if os.path.exists(file_path): + return + my_logger.info(f"Waiting for {file_path}") + start_time = time.time() + while not os.path.exists(file_path): + await asyncio.sleep(check_interval) + my_logger.info( + f"Done waiting for {file_path} after {time.time() - start_time:.1f} seconds" + ) + + +if __name__ == "__main__": + main() diff --git a/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/requirements.txt b/src/inspect_ai/tool/beta/_computer/_resources/computer_tool/requirements.txt new file mode 100644 index 000000000..e69de29bb diff --git a/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/.config/tint2/applications/code.desktop b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/.config/tint2/applications/code.desktop new file mode 100755 index 000000000..a8ef853f0 --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/.config/tint2/applications/code.desktop @@ -0,0 +1,8 @@ +[Desktop Entry] +Name=VS Code +Comment=Open VS Code +Exec=code +Icon=/usr/share/code/resources/app/resources/linux/code.png +Terminal=false +Type=Application +Categories=TextEditor; diff --git a/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/.config/tint2/applications/firefox-custom.desktop b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/.config/tint2/applications/firefox-custom.desktop new file mode 100755 index 000000000..123f09aa0 --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/.config/tint2/applications/firefox-custom.desktop @@ -0,0 +1,8 @@ +[Desktop Entry] +Name=Firefox Custom +Comment=Open Firefox with custom URL +Exec=firefox-esr -new-window -profile /home/computeruse/.mozilla/firefox-esr/profile.default +Icon=firefox-esr +Terminal=false +Type=Application +Categories=Network;WebBrowser; diff --git a/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/.config/tint2/applications/terminal.desktop b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/.config/tint2/applications/terminal.desktop new file mode 100644 index 000000000..0c2d45d4d --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/.config/tint2/applications/terminal.desktop @@ -0,0 +1,8 @@ +[Desktop Entry] +Name=Terminal +Comment=Open Terminal +Exec=xterm +Icon=utilities-terminal +Terminal=false +Type=Application +Categories=System;TerminalEmulator; diff --git a/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/.config/tint2/tint2rc b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/.config/tint2/tint2rc new file mode 100644 index 000000000..cab44bc01 --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/.config/tint2/tint2rc @@ -0,0 +1,100 @@ +#------------------------------------- +# Panel +panel_items = TL +panel_size = 100% 60 +panel_margin = 0 0 +panel_padding = 2 0 2 +panel_background_id = 1 +wm_menu = 0 +panel_dock = 0 +panel_position = bottom center horizontal +panel_layer = top +panel_monitor = all +panel_shrink = 0 +autohide = 0 +autohide_show_timeout = 0 +autohide_hide_timeout = 0.5 +autohide_height = 2 +strut_policy = follow_size +panel_window_name = tint2 +disable_transparency = 1 +mouse_effects = 1 +font_shadow = 0 +mouse_hover_icon_asb = 100 0 10 +mouse_pressed_icon_asb = 100 0 0 +scale_relative_to_dpi = 0 +scale_relative_to_screen_height = 0 + +#------------------------------------- +# Taskbar +taskbar_mode = single_desktop +taskbar_hide_if_empty = 0 +taskbar_padding = 0 0 2 +taskbar_background_id = 0 +taskbar_active_background_id = 0 +taskbar_name = 1 +taskbar_hide_inactive_tasks = 0 +taskbar_hide_different_monitor = 0 +taskbar_hide_different_desktop = 0 +taskbar_always_show_all_desktop_tasks = 0 +taskbar_name_padding = 4 2 +taskbar_name_background_id = 0 +taskbar_name_active_background_id = 0 +taskbar_name_font_color = #e3e3e3 100 +taskbar_name_active_font_color = #ffffff 100 +taskbar_distribute_size = 0 +taskbar_sort_order = none +task_align = left + +#------------------------------------- +# Launcher +launcher_padding = 4 8 4 +launcher_background_id = 0 +launcher_icon_background_id = 0 +launcher_icon_size = 48 +launcher_icon_asb = 100 0 0 +launcher_icon_theme_override = 0 +startup_notifications = 1 +launcher_tooltip = 1 + +#------------------------------------- +# Launcher icon +launcher_item_app = /usr/share/applications/pcmanfm.desktop +launcher_item_app = /home/computeruse/.config/tint2/applications/terminal.desktop +launcher_item_app = /home/computeruse/.config/tint2/applications/firefox-custom.desktop +launcher_item_app = /usr/share/applications/xpaint.desktop +launcher_item_app = /usr/share/applications/xpdf.desktop +launcher_item_app = /home/computeruse/.config/tint2/applications/code.desktop +launcher_item_app = /usr/share/applications/galculator.desktop + +#------------------------------------- +# Background definitions +# ID 1 +rounded = 0 +border_width = 0 +background_color = #000000 60 +border_color = #000000 30 + +# ID 2 +rounded = 4 +border_width = 1 +background_color = #777777 20 +border_color = #777777 30 + +# ID 3 +rounded = 4 +border_width = 1 +background_color = #777777 20 +border_color = #ffffff 40 + +# ID 4 +rounded = 4 +border_width = 1 +background_color = #aa4400 100 +border_color = #aa7733 100 + +# ID 5 +rounded = 4 +border_width = 1 +background_color = #aaaa00 100 +border_color = #aaaa00 100 diff --git a/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/README.md b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/README.md new file mode 100644 index 000000000..571a2a8f5 --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/README.md @@ -0,0 +1,28 @@ +# About This Image + +This image is based heavily on the image from Anthropic's Computer Use Demo [here](https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo/image). + +It has been adapted to launch only those tools required for the Inspect `computer_tool` to interact with the computer via X11 and `xdotool`. + +## Tools Launched + +1. **Xvfb (X Virtual Framebuffer)** + - **Script:** `xvfb_startup.sh` + - **Description:** Xvfb is a display server implementing the X11 display server protocol. It runs in memory and does not require a physical display. This is useful for running graphical applications in a headless environment. + +2. **tint2** + - **Script:** `tint2_startup.sh` + - **Description:** tint2 is a lightweight panel/taskbar. It provides a taskbar, system tray, and application launcher. It is highly configurable and is used to manage and display open applications. + +3. **Mutter** + - **Script:** `mutter_startup.sh` + - **Description:** Mutter is a window manager for the X Window System. It is used to manage windows and provide compositing effects. In this setup, it is used to replace the default window manager and provide a graphical environment. + +4. **x11vnc** + - **Script:** `x11vnc_startup.sh` + - **Description:** x11vnc is a VNC server that allows remote access to the X11 display. It enables users to connect to the virtual display environment from a remote machine using a VNC client. + +## `.config/tint2` Directory + +The `.config/tint2` directory contains configuration files for tint2. These files define the appearance and behavior of the tint2 panel, including the taskbar, system tray, and application launcher. You can customize the tint2 panel by modifying the configuration files in this directory. + diff --git a/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/entrypoint.sh b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/entrypoint.sh new file mode 100755 index 000000000..f79b44d87 --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/entrypoint.sh @@ -0,0 +1,17 @@ +#!/bin/bash +set -e + +export DISPLAY=:${DISPLAY_NUM} + +# remove marker files +rm -f /tmp/.X${DISPLAY_NUM}-lock +rm -f /tmp/mutter_started + +./xvfb_startup.sh +./mutter_startup.sh +./tint2_startup.sh +./x11vnc_startup.sh +./novnc_startup.sh + +# Keep the container running +tail -f /dev/null diff --git a/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/mutter_startup.sh b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/mutter_startup.sh new file mode 100755 index 000000000..17c2795fc --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/mutter_startup.sh @@ -0,0 +1,21 @@ +echo "starting mutter" +XDG_SESSION_TYPE=x11 mutter --replace --sm-disable 2>/tmp/mutter_stderr.log & + +# Wait for tint2 window properties to appear +timeout=30 +while [ $timeout -gt 0 ]; do + if xdotool search --class "mutter" >/dev/null 2>&1; then + break + fi + sleep 1 + ((timeout--)) +done + +if [ $timeout -eq 0 ]; then + echo "mutter stderr output:" >&2 + cat /tmp/mutter_stderr.log >&2 + exit 1 +fi + +touch /tmp/mutter_started +rm /tmp/mutter_stderr.log diff --git a/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/novnc_startup.sh b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/novnc_startup.sh new file mode 100755 index 000000000..6acee6b4b --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/novnc_startup.sh @@ -0,0 +1,20 @@ +#!/bin/bash +echo "starting noVNC" + +# Start noVNC with explicit websocket settings +websockify \ + --web=/usr/share/novnc/ \ + 6080 localhost:5900 \ + > /tmp/novnc.log 2>&1 & + +# Wait for noVNC to start +timeout=10 +while [ $timeout -gt 0 ]; do + if netstat -tuln | grep -q ":6080 "; then + break + fi + sleep 1 + ((timeout--)) +done + +echo "noVNC started successfully" diff --git a/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/tint2_startup.sh b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/tint2_startup.sh new file mode 100755 index 000000000..34f39a18b --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/tint2_startup.sh @@ -0,0 +1,24 @@ +#!/bin/bash +echo "starting tint2 on display :$DISPLAY_NUM ..." + +# Start tint2 and capture its stderr +tint2 -c $HOME/.config/tint2/tint2rc 2>/tmp/tint2_stderr.log & + +# Wait for tint2 window properties to appear +timeout=30 +while [ $timeout -gt 0 ]; do + if xdotool search --class "tint2" >/dev/null 2>&1; then + break + fi + sleep 1 + ((timeout--)) +done + +if [ $timeout -eq 0 ]; then + echo "tint2 stderr output:" >&2 + cat /tmp/tint2_stderr.log >&2 + exit 1 +fi + +# Remove the temporary stderr log file +rm /tmp/tint2_stderr.log diff --git a/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/x11vnc_startup.sh b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/x11vnc_startup.sh new file mode 100755 index 000000000..ccb2fa7a3 --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/x11vnc_startup.sh @@ -0,0 +1,48 @@ +#!/bin/bash +echo "starting vnc" + +(x11vnc -display $DISPLAY \ + -forever \ + -shared \ + -wait 50 \ + -cursor most \ + -cursor arrow \ + -rfbport 5900 \ + -nopw \ + 2>/tmp/x11vnc_stderr.log) & + +x11vnc_pid=$! + +# Wait for x11vnc to start +timeout=10 +while [ $timeout -gt 0 ]; do + if netstat -tuln | grep -q ":5900 "; then + break + fi + sleep 1 + ((timeout--)) +done + +if [ $timeout -eq 0 ]; then + echo "x11vnc failed to start, stderr output:" >&2 + cat /tmp/x11vnc_stderr.log >&2 + exit 1 +fi + +: > /tmp/x11vnc_stderr.log + +# Monitor x11vnc process in the background +( + while true; do + if ! kill -0 $x11vnc_pid 2>/dev/null; then + echo "x11vnc process crashed, restarting..." >&2 + if [ -f /tmp/x11vnc_stderr.log ]; then + echo "x11vnc stderr output:" >&2 + cat /tmp/x11vnc_stderr.log >&2 + rm /tmp/x11vnc_stderr.log + fi + exec "$0" + fi + sleep 5 + done +) & diff --git a/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/xvfb_startup.sh b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/xvfb_startup.sh new file mode 100755 index 000000000..9b9ae5852 --- /dev/null +++ b/src/inspect_ai/tool/beta/_computer/_resources/image_home_dir/xvfb_startup.sh @@ -0,0 +1,48 @@ +#!/bin/bash +set -e # Exit on error + +DPI=96 +RES_AND_DEPTH=${WIDTH}x${HEIGHT}x24 + +# Function to check if Xvfb is already running +check_xvfb_running() { + if [ -e /tmp/.X${DISPLAY_NUM}-lock ]; then + return 0 # Xvfb is already running + else + return 1 # Xvfb is not running + fi +} + +# Function to check if Xvfb is ready +wait_for_xvfb() { + local timeout=10 + local start_time=$(date +%s) + while ! xdpyinfo >/dev/null 2>&1; do + if [ $(($(date +%s) - start_time)) -gt $timeout ]; then + echo "Xvfb failed to start within $timeout seconds" >&2 + return 1 + fi + sleep 0.1 + done + return 0 +} + +# Check if Xvfb is already running +if check_xvfb_running; then + echo "Xvfb is already running on display ${DISPLAY}" + exit 0 +fi + +# Start Xvfb +Xvfb $DISPLAY -ac -screen 0 $RES_AND_DEPTH -retro -dpi $DPI -nolisten tcp -nolisten unix & +XVFB_PID=$! + +# Wait for Xvfb to start +if wait_for_xvfb; then + echo "Xvfb started successfully on display ${DISPLAY}" + echo "Xvfb PID: $XVFB_PID" +else + echo "Xvfb failed to start" + kill $XVFB_PID + exit 1 +fi diff --git a/src/inspect_ai/util/_sandbox/docker/docker.py b/src/inspect_ai/util/_sandbox/docker/docker.py index 876a557c2..223471a16 100644 --- a/src/inspect_ai/util/_sandbox/docker/docker.py +++ b/src/inspect_ai/util/_sandbox/docker/docker.py @@ -1,4 +1,5 @@ import errno +import json import os import tempfile from logging import getLogger @@ -7,9 +8,11 @@ from typing_extensions import override -from inspect_ai.util._subprocess import ExecResult +from inspect_ai.util._subprocess import ExecResult, subprocess from ..environment import ( + HostMapping, + PortMapping, SandboxConnection, SandboxEnvironment, SandboxEnvironmentConfigType, @@ -439,6 +442,7 @@ async def connection(self) -> SandboxConnection: "remote-containers.attachToRunningContainer", container, ], + ports=await get_ports_info(container), container=container, ) # error (not currently running) @@ -468,3 +472,62 @@ async def container_working_dir( + f"{result.stderr}" ) return default + + +async def get_ports_info(container: str) -> list[PortMapping] | None: + try: + result = await subprocess( + [ + "docker", + "inspect", + container, + "--format", + "{{json .NetworkSettings.Ports}}", + ], + timeout=60, + ) + + if not result.success: + raise RuntimeError(result.stderr) + + return parse_docker_inspect_ports(result.stdout) + + # It's currently a policy decision to let docker timeouts to be silent. + except TimeoutError: + return None + + +def parse_docker_inspect_ports(json_str: str) -> list[PortMapping] | None: + """ + Parses the JSON output from `docker inspect {container_name} --format='{{json .NetworkSettings.Ports}}'` to extract port mappings. + + Args: + json_str (str): A JSON string representing the `NetworkSettings.Ports` output of `docker inspect`. e.g. + ``` + { + "5900/tcp": [{"HostIp": "0.0.0.0", "HostPort": "54023"}], + "8080/tcp": [{"HostIp": "0.0.0.0", "HostPort": "54024"}] + } + ``` + + Returns: + list[PortMapping] | None: A list of PortMapping objects if any port mappings are found, + otherwise None. + """ + data = json.loads(json_str) + port_mappings = [] + for port_protocol, mappings in data.items(): + if mappings is None: + continue + container_port, protocol = port_protocol.split("/") + host_mappings = [ + HostMapping(host_ip=mapping["HostIp"], host_port=int(mapping["HostPort"])) + for mapping in mappings + ] + port_mapping = PortMapping( + container_port=int(container_port), + protocol=protocol, + mappings=host_mappings, + ) + port_mappings.append(port_mapping) + return port_mappings if port_mappings else None diff --git a/src/inspect_ai/util/_sandbox/docker/internal.py b/src/inspect_ai/util/_sandbox/docker/internal.py index 4a4108b86..2fe96102d 100644 --- a/src/inspect_ai/util/_sandbox/docker/internal.py +++ b/src/inspect_ai/util/_sandbox/docker/internal.py @@ -6,13 +6,19 @@ INSPECT_WEB_BROWSER_IMAGE_DOCKERHUB = "aisiuk/inspect-web-browser-tool" INSPECT_WEB_BROWSER_IMAGE = "inspect_web_browser" +INSPECT_COMPUTER_BETA_IMAGE = "inspect-computer-tool-beta" INTERNAL_IMAGES = { INSPECT_WEB_BROWSER_IMAGE: PKG_PATH / "tool" / "_tools" / "_web_browser" - / "_resources" + / "_resources", + INSPECT_COMPUTER_BETA_IMAGE: PKG_PATH + / "tool" + / "beta" + / "_computer" + / "_resources", } diff --git a/src/inspect_ai/util/_sandbox/environment.py b/src/inspect_ai/util/_sandbox/environment.py index 641798680..749f054e3 100644 --- a/src/inspect_ai/util/_sandbox/environment.py +++ b/src/inspect_ai/util/_sandbox/environment.py @@ -28,6 +28,17 @@ ] +class HostMapping(BaseModel): + host_ip: str + host_port: int + + +class PortMapping(BaseModel): + container_port: int + protocol: Literal["tcp", "udp"] + mappings: list[HostMapping] + + class SandboxConnection(BaseModel): """Information required to connect to sandbox.""" @@ -40,6 +51,9 @@ class SandboxConnection(BaseModel): vscode_command: list[Any] | None = Field(default=None) """Optional vscode command (+args) to connect to sandbox.""" + ports: list[PortMapping] | None = Field(default=None) + """Optional list of port mappings into container""" + container: str | None = Field(default=None) """Optional container name (does not apply to all sandboxes).""" diff --git a/tools/vscode/src/@types/log.d.ts b/tools/vscode/src/@types/log.d.ts index 551055c9c..8c461ca6b 100644 --- a/tools/vscode/src/@types/log.d.ts +++ b/tools/vscode/src/@types/log.d.ts @@ -76,6 +76,7 @@ export type NumChoices = number | null; export type Logprobs = boolean | null; export type TopLogprobs = number | null; export type ParallelToolCalls = boolean | null; +export type InternalTools = boolean | null; export type MaxToolOutput = number | null; export type CachePrompt = "auto" | boolean | null; export type ReasoningEffort = ("low" | "medium" | "high") | null; @@ -545,6 +546,7 @@ export interface GenerateConfig { logprobs: Logprobs; top_logprobs: TopLogprobs; parallel_tool_calls: ParallelToolCalls; + internal_tools: InternalTools; max_tool_output: MaxToolOutput; cache_prompt: CachePrompt; reasoning_effort: ReasoningEffort; @@ -897,6 +899,7 @@ export interface GenerateConfig1 { logprobs: Logprobs; top_logprobs: TopLogprobs; parallel_tool_calls: ParallelToolCalls; + internal_tools: InternalTools; max_tool_output: MaxToolOutput; cache_prompt: CachePrompt; reasoning_effort: ReasoningEffort;