Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of a computer tool. #1063

Merged
merged 44 commits into from
Jan 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
f641e1b
SQUASHED & CHERRY PICKED from feature/anthropic-native-bash-tool
jjallaire Jan 3, 2025
abc3cf6
FIX FOR feature/anthropic-native-bash-tool
Jan 9, 2025
38414f5
Provisional implementation of computer tool.
Jan 8, 2025
349f808
Merge remote-tracking branch 'upstream/main' into computer
Jan 13, 2025
0576646
feedback
Jan 13, 2025
021a8f4
feedback
Jan 13, 2025
e8c0d2c
raise RuntimeError rather than Exception
Jan 13, 2025
2d32bb4
Use sandbox_with.
Jan 13, 2025
de503a0
ruff
Jan 14, 2025
780c54c
first draft of computer use docs
jjallaire Jan 14, 2025
59e0919
First steps in making port mappings available in SandboxConnection
Jan 14, 2025
ae858e9
Plumb the actual port mappings
Jan 14, 2025
060f5eb
remove stray whitespace
Jan 14, 2025
3fae390
whoops
Jan 14, 2025
4463525
Surface port mappings in SandboxesView
Jan 14, 2025
f55382e
approval: glob match full function call
jjallaire Jan 14, 2025
1c89263
port mappings step
Jan 14, 2025
47fe15a
propagate --display cli option
jjallaire Jan 14, 2025
f9c30a8
make no content a global constant
jjallaire Jan 14, 2025
0d33f52
task config: custom args have display priority over config
jjallaire Jan 14, 2025
833f4ec
tighten up conversation mode
jjallaire Jan 14, 2025
0262cf7
tweaks to intervention mode example
jjallaire Jan 14, 2025
b7a6a06
docs on approval
jjallaire Jan 14, 2025
8cd193e
Iterate on the port mapping ui.
Jan 14, 2025
1005992
tweak docstring
Jan 14, 2025
a00d8a0
docs on new vnc port mappings ui
jjallaire Jan 15, 2025
6eeffa9
Put noVNC back in the image.
Jan 15, 2025
e8abac6
Merge remote-tracking branch 'upstream/main' into computer
Jan 15, 2025
e4a157c
tweak doc
Jan 15, 2025
c9ff17d
Update the port mappings style to conserve horizontal space and avoid…
Jan 15, 2025
a518404
reorder mutter and tint2 to dramatically speed up container startup
Jan 15, 2025
2716c0d
Add proper formatting for multiple containers.
Jan 15, 2025
211b4cc
Replace gedit with VSCode in the image.
Jan 15, 2025
9323455
Configure firefox to skip first run UI.
Jan 15, 2025
5262cff
minor doc tweaks
jjallaire Jan 15, 2025
fe38540
Bump the resolution up to 1920x1080.
Jan 15, 2025
5f92889
Iterate on how VS Code is installed hoping to better support both amd…
Jan 16, 2025
f81345b
configure novnc url to scale the screen to fit the browser window
Jan 16, 2025
319fe3a
Update the screen dimensions and clarify w/comment.
Jan 16, 2025
e787676
Models sometimes send multiple keys with the key command. e.g. "ctrl+…
Jan 16, 2025
19532c7
Move computer tool into .tool.beta namespace and rename internal imag…
Jan 16, 2025
ffbc940
doc tweaks
jjallaire Jan 16, 2025
17a510d
Merge branch 'main' into computer
jjallaire Jan 16, 2025
88622a1
ruff
jjallaire Jan 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 22 additions & 3 deletions docs/approval.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -33,20 +33,39 @@ You can chain to together the `human` and `auto` approvers in an *approval polic
``` yaml
approvers:
- name: human
tools: ["web_browser_click", "web_browser_type*"]
tools: ["web_browser_click", "web_browser_type"]

- name: auto
tools: "*"
```

Navigational web browser tool calls (e.g. `web_browser_go`) are approved automatically via the catch-all `auto` approver at the end of the chain. Note that when listing an approver in a policy you indicate which tools it should handle using a glob or list of globs.

Navigational web browser tool calls (e.g. `web_browser_go`) are approved automatically via the catch-all `auto` approver at the end of the chain. Note that when listing an approver in a policy you indicate which tools it should handle using a glob or list of globs. These globs are prefix matched so the `web_browser_type` glob matches both `web_browser_type` and `web_browser_type_submit`.

To use this policy, pass the path to the policy YAML file as the approver. For example:

``` bash
inspect eval browser.py --approval approval.yaml
```

You can also match on tool arguments (for tools that dispatch many action types). For example, here is an approval policy for the [Computer Tool](tools.qmd#sec-computer) which allows typing and mouse movement but requires approval for key combos (e.g. Enter or a shortcut) and typing:


```{.yaml filename="approval.yaml"}
approvers:
- name: human
tools:
- computer(action='key'
- computer(action='left_click'
- computer(action='middle_click'
- computer(action='double_click'

- name: auto
tools: "*"
```

Note that since this is a prefix match and there could be other arguments, we don't end the tool match pattern with a parentheses.

## Approvers in Code

We've demonstrated configuring approvers via a YAML approval policy file—you can also provide a policy directly in code (useful if it needs to be more dynamic). Here's a pure Python version of the example from the previous section:
Expand Down Expand Up @@ -152,7 +171,7 @@ Assuming we have properly [registered our approver](extensions.qmd#sec-extension
``` yaml
approvers:
- name: evaltools/bash_allowlist
tools: "*bash*"
tools: "bash"
allowed_commands: ["ls", "echo", "cat"]

- name: human
Expand Down
Binary file added docs/images/vnc-port-info.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/vnc-view-only.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
177 changes: 168 additions & 9 deletions docs/tools.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ title: Tools

Many models now have the ability to interact with client-side Python functions in order to expand their capabilities. This enables you to equip models with your own set of custom tools so they can perform a wider variety of tasks.

Inspect natively supports registering Python functions as tools and providing these tools to models that support them (currently OpenAI, Claude 3, Google Gemini, and Mistral). Inspect also includes several built-in tools ([bash](#sec-bash-and-python), [python](#sec-bash-and-python), and [web_search](#sec-web-search)).
Inspect natively supports registering Python functions as tools and providing these tools to models that support them (currently OpenAI, Claude 3, Google Gemini, and Mistral). Inspect also includes several built-in tools ([bash](#sec-bash-and-python), [python](#sec-bash-and-python), [computer](#sec-computer), [web browser](#sec-web-browser), and [web_search](#sec-web-search)).

::: callout-note
### Tools and Agents
Expand All @@ -22,6 +22,8 @@ Inspect has several built-in tools, including:

- [Web Browser](#sec-web-browser), which provides the model with a headless Chromium web browser that supports navigation, history, and mouse/keyboard interactions.

- [Computer](#sec-computer), which provides the model with a desktop computer (viewed through screenshots) that supports mouse and keyboard interaction.

- [Web Search](#sec-web-search), which uses the Google Search API to execute and summarise web searches.

If you are only interested in using the built-in tools, check out their respective documentation links above. To learn more about creating your own tools read on immediately below.
Expand Down Expand Up @@ -371,16 +373,16 @@ Note that unlike some other tool functions like `bash()`, the `web_browser()` fu

If you review the transcripts of a sample with access to the web browser tool, you'll notice that there are several distinct tools made available for control of the web browser. These tools include:

| Tool | Description |
| Tool | Description |
|------------------------------------|------------------------------------|
| `web_browser_go(url)` | Navigate the web browser to a URL. |
| `web_browser_click(element_id)` | Click an element on the page currently displayed by the web browser. |
| `web_browser_type(element_id)` | Type text into an input on a web browser page. |
| `web_browser_go(url)` | Navigate the web browser to a URL. |
| `web_browser_click(element_id)` | Click an element on the page currently displayed by the web browser. |
| `web_browser_type(element_id)` | Type text into an input on a web browser page. |
| `web_browser_type_submit(element_id, text)` | Type text into a form input on a web browser page and press ENTER to submit the form. |
| `web_browser_scroll(direction)` | Scroll the web browser up or down by one page. |
| `web_browser_forward()` | Navigate the web browser forward in the browser history. |
| `web_browser_back()` | Navigate the web browser back in the browser history. |
| `web_browser_refresh()` | Refresh the current page of the web browser. |
| `web_browser_scroll(direction)` | Scroll the web browser up or down by one page. |
| `web_browser_forward()` | Navigate the web browser forward in the browser history. |
| `web_browser_back()` | Navigate the web browser back in the browser history. |
| `web_browser_refresh()` | Refresh the current page of the web browser. |

: {tbl-colwidths=\[35,65\]}

Expand Down Expand Up @@ -420,6 +422,162 @@ CMD ["python3", "/app/web_browser/web_server.py"]

Note that all of the Python files in the [\_resources](https://github.com/UKGovernmentBEIS/inspect_ai/blob/main/src/inspect_ai/tool/_tools/_web_browser/_resources/) directory alongside the `Dockerfile` need to be available for copying when building the container.

## Computer (Beta) {#sec-computer}

::: {.callout-note appearance="simple"}
The beta version of the computer tool described below is currently available only in the development version of Inspect. To install the development version:

``` bash
pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
```
:::

The `computer()` tool provides models with a computer desktop environment along with the ability to view the screen and perform mouse and keyboard gestures. The computer tool is based on the Anthropic [Computer Use Beta](https://docs.anthropic.com/en/docs/build-with-claude/computer-use) reference implementation and works with any model that supports image input.

The current release of the computer tool is a beta version (exported from the `inspect_ai.tool.beta` module). We expect to finalise the interface and move it into the main `inspect_ai.tool` module over the next several weeks.

### Configuration

The `computer()` tool runs within a Docker container. To use it with a task you need to reference the `inspect-computer-tool-beta` image in your Docker compose file. For example:

``` {.yaml filename="compose.yaml"}
services:
default:
image: inspect-computer-tool-beta
```

You can configure the container to not have Internet access as follows:

``` {.yaml filename="compose.yaml"}
services:
default:
image: inspect-computer-tool-beta
network_mode: none
```

Note that if you'd like to be able to view the model's interactions with the computer desktop in realtime, you will need to also do some port mapping to enable a VNC connection with the container. See the [VNC Client](#vnc-client) section below for details on how to do this.

The `inspect-computer-tool-beta` image is based on the [ubuntu:22.04](https://hub.docker.com/layers/library/ubuntu/22.04/images/sha256-965fbcae990b0467ed5657caceaec165018ef44a4d2d46c7cdea80a9dff0d1ea?context=explore) image and includes the following additional applications pre-installed:

- Firefox
- VS Code
- Xpdf
- Xpaint
- galculator

We'll be refining this list as well as publishing more information on creating custom containers for use with the computer tool soon.

### Task Setup

A task configured to use the computer tool might look like this:

``` python
from inspect_ai import Task, task
from inspect_ai.scorer import match
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool.beta import computer

@task
def computer_task():
return Task(
dataset=read_dataset(),
solver=[
use_tools([computer()]),
generate(),
],
scorer=match(),
sandbox=("docker", "compose.yaml"),
)
```

Two of the Inspect examples demonstrate basic computer use:

- [computer](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/examples/computer/computer.py) — Three simple computing tasks as a minimal demonstration of computer use.

``` bash
inspect eval examples/computer
```

- [intervention](https://github.com/UKGovernmentBEIS/inspect_ai/tree/main/examples/intervention/intervention.py) — Computer task driven interactively by a human operator.

``` bash
inspect eval examples/intervention -T mode=computer --display conversation
```

### VNC Client {#vnc-client}

You can use a [VNC](https://en.wikipedia.org/wiki/VNC) connection to the container to watch computer use in real-time. This requires some additional port-mapping in the Docker compose file. You can define dynamic port ranges for VNC (5900) and a browser based noVNC client (6080) with the following `ports` entries:

``` {.yaml filename="compose.yaml"}
services:
default:
image: inspect-computer-tool-beta
ports:
- "5900"
- "6080"
```

To connect to the container for a given sample, locate the sample in the **Running Samples** UI and expand the sample info panel at the top:

![](images/vnc-port-info.png){width=958 .lightbox}

Click on the link for the noVNC browser client, or use a native VNC client to connect to the VNC port. Note that the VNC server will take a few seconds to start up so you should give it some time and attempt to reconnect as required if the first connection fails.

The browser based client provides a view-only interface. If you use a native VNC client you should also set it to "view only" so as to not interfere with the model's use of the computer. For example, for Real VNC Viewer:

![](images/vnc-view-only.png){width="549"}

### Approval

If the container you are using is connected to the Internet, you may want to configure human approval for a subset of computer tool actions. Here are the possible actions (specified using the `action` parameter to the `computer` tool):

- `key`: Press a key or key-combination on the keyboard.
- `type`: Type a string of text on the keyboard.
- `cursor_position`: Get the current (x, y) pixel coordinate of the cursor on the screen.
- `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.
- Example: execute(action="mouse_move", coordinate=(100, 200))
- `left_click`: Click the left mouse button.
- `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.
- `right_click`: Click the right mouse button.
- `middle_click`: Click the middle mouse button.
- `double_click`: Double-click the left mouse button.
- `screenshot`: Take a screenshot.


Here is an approval policy that requires approval for key combos (e.g. `Enter` or a shortcut) and mouse clicks:

```{.yaml filename="approval.yaml"}
approvers:
- name: human
tools:
- computer(action='key'
- computer(action='left_click'
- computer(action='middle_click'
- computer(action='double_click'

- name: auto
tools: "*"
```

Note that since this is a prefix match and there could be other arguments, we don't end the tool match pattern with a parentheses.

You can apply this policy using the `--approval` commmand line option:

```bash
inspect eval computer.py --approval approval.yaml
```

### Tool Binding

The computer tool's schema is based on the standard Anthropoic [computer tool-type](https://docs.anthropic.com/en/docs/build-with-claude/computer-use#computer-tool). When using Claude 3.5 the coputer tool will automatically bind to the native Claude computer tool definition. This presumably provides improved performance due to fine tuning on the use of the tool but we have not verified this.

If you want to experiement with bypassing the native Claude computer tool type and just register the computer tool as a normal function based tool then specify the `--no-internal-tools` generation option as follows:

```bash
inspect eval computer.py --no-internal-tools
```


## Web Search {#sec-web-search}

The `web_search()` tool provides models the ability to enhance their context window by performing a search. By default web searches retrieve 10 results from a provider, uses a model to determine if the contents is relevant then returns the top 3 relevant search results to the main model. Here is the definition of the `web_search()` function:
Expand Down Expand Up @@ -465,3 +623,4 @@ The `web_search()` tool uses [Google Programmable Search Engine](https://program
- `GOOGLE_CSE_ID` — Google Custom Search Engine ID

- `GOOGLE_CSE_API_KEY` — Google API key used to enable the Search API

24 changes: 24 additions & 0 deletions examples/computer/compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
services:
default:
# Temporary internal image until the official one is available
image: inspect-computer-tool-beta
init: true

# If you only launch a single container, you can vnc into that container by using
# the following port mapping
# ports:
# - "5900:5900"
# - "6080:6080"

# If you launch multiple containers, you can vnc into each container by using the
# following port mapping which will dynamically bind to host ports. The specific
# bindings can be found by using `docker inspect <container_id_or_name>`. This
# info is included in Running Samples tab. The output will look something like:
#
# service container port host port url
# VNC 5900 61029 vnc://localhost:61029
# noVNC 6080 61030 http://localhost:61030?view_only=true&autoconnect=true

ports:
- "5900"
- "6080"
50 changes: 50 additions & 0 deletions examples/computer/computer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import includes
from inspect_ai.solver import system_message
from inspect_ai.solver._basic_agent import basic_agent
from inspect_ai.tool.beta import computer

SYSTEM_MESSAGE = """
Before each step, please clearly explain your intent for performing a tool action: "I expect tool command X to ...".

After each step, carefully evaluate the resulting screenshot to see if the command achieved the right outcome.

Explicitly show your thinking: "I have evaluated step X..." If not correct, try again. Only when
you confirm a step was executed correctly should you move on to the next one.

Note that launching applications from the bottom task bar requires a single left click.
"""


@task
def computer_task():
return Task(
dataset=[
Sample(
input="Use the computer tool to find the contents of the file '/tmp/flag.txt'. When you've found it, report its contents.",
target="Frunobulax",
files={"/tmp/flag.txt": "Frunobulax"},
),
Sample(
input="Use the computer tool and open the spreadsheet '/tmp/moonWeight.ods`. Enter your weight as 156 lbs. Report the result from cell B2. Important: If the 'Tip of the Day' dialog appears, you'll need to close it before proceeding. Important: You may need to install a spreadsheet using a command like 'sudo apt-get install -y libreoffice'.",
target="26",
files={"/tmp/moonWeight.ods": "moonWeight.ods"},
),
Sample(
input="Use the computer tool to launch a terminal. Type 'Trudging across the tundra. Mile after Mile.' into the terminal. Important: Make sure that the terminal window is active before typing. When you are done, please use the submit tool to record the result of hitting enter in the terminal after entering that text.",
target="bash: Trudging: command not found",
),
Sample(
input="Use the computer tool to launch a calculator. Calculate 123 x 456. Report the result.",
target="56088",
),
],
solver=basic_agent(
init=system_message(SYSTEM_MESSAGE),
tools=[computer()],
max_messages=100,
),
scorer=includes(),
sandbox="docker",
)
Binary file added examples/computer/moonWeight.ods
Binary file not shown.
Loading
Loading