-
Notifications
You must be signed in to change notification settings - Fork 45
Creating A Module
Workbench comes with many modules for loading data, cleaning it, visualizing it, etc. But it's also a "package manager" for all those little pieces of code that are necessary to do data work. You can create your own modules with Python, and they can optionally include JavaScript to produce embeddable visualizations or custom UI elements.
- Clone the Hello Workbench module
- Clone the main workbench repo into a sibling directory and set up a Workbench development environment
- Fire up Workbench with
CACHE_MODULES=false bin/dev start
- Watch the module directory with
bin/dev develop-module hello-workbench
. This will re-import the module whenever you make any changes. - Browse to
127.0.0.1:8000
to use Workbench and try out your module
Workbench loads custom modules from GitHub on production, or from a local copy of the repo when developing. There must be at least two files in your repo: a JSON configuration file which defines module metadata and parameters, and a Python script which does the actual data processing. You can also add a JavaScript file which produces output in the right pane, as Workbench's built-in charts do.
We recommend you also write tests for your new module.
Once you've checked your module into Github, you can add it with the Import Module from Github command in Workbench.
Here are some examples of existing Workbench modules:
The JSON file is required for every module and defines metadata including the module name, an internal unique identifier, and most importantly all the parameters that are displayed in the module UI.
For example, if you were to create a Twitter module that allowed users to search by user or a term, your module configuration could look something like this
{
"name": "Search and Replace",
"id_name": "S&R",
"category" : "Clean",
"description" : “Search for text and replace it with something else",
""help_url": "https://mymodules.com/docs/search-and-replace",
"parameters": [
{
"name" : "Search for",
"id_name" : "search",
"type" : "string",
},
{
"name": "Replace with",
"id_name" : "replace",
"type": "string",
},
{
"name": "Column (or None for all)",
"id_name": "column",
"type": column
}
]
}
This module has three parameters: two strings and a column selector.
All modules must define the following keys
-
name
- The user-visible name of the module -
id_name
- An internal unique identifier. It must never change, or currently applied modules will break. -
category
- The category the module appears under in the Module Library
The following keys are optional but recommended:
-
description
- An optional one-line description used to help users search for modules -
help_url
- An optional link to a help page describing how to use the module -
icon
- Must be one of a set of internal icons; see other module JSON files for options.
Each parameter must define the following keys:
-
name
- User visible -
id_name
- Internal unique identifier. Must not change, or Workbench will think it's a brand new parameter. However, different modules can use the sameid_name
.
They can have several optional keys:
-
default
- The initial value of the parameter. -
placeholder
- The text that appears when the parameter field is empty, or column is selected. -
visible_if
- Hides or shows this parameter based on the value of a menu or checkbox parameter.
The visible_if
key is JSON object (content is inside braces) which itself has the following keys:
-
id_name
- Which parameter controls the visibility of this parameter. It must be a menu or checkbox. -
values
- A list of menu values separated by |, ortrue
orfalse
for a checkbox -
invert
- Optional. If set to true, the parameter is visible if the controlling parameter does not have one of thevalues
.
Some parameter types also support custom flags; see below.
Workbench currently supports the following parameter types:
-
string
- Yup. Can havemultiline
set to true if you want an expandable text field. -
integer
- An integer value -
float
- A decimal value -
column
- Allows the user to select a column. Theplaceholder
value appears when no column is selected. -
multicolumn
- Allows the user to select multiple columns. (Bug:id_name
must becolname
for this to work) -
menu
- A fixed list of menu items, which must be listed in themenu_items
key separated by the pipe character (|). The zero-based index of the selected item is passed to therender
function, and thedefault
key is also zero-based. -
checkbox
- A simple boolean control. -
statictext
- Just shows thename
as text, has no parameter value. Useful to explain to the user what to do.
The Python file may contain a single function called render
.
Write a render
function that accepts two parameters: a Pandas DataFrame
and a dictionary of parameter values. It should return a DataFrame. For instance:
def render(table, params):
s = params['search']
r = params['replace']
col = params['column']
if col is None:
return table.replace(s, r)
else
return table[col].replace(s,r)
Tips:
- Your module will be rendered as soon as the user adds it. It's a good idea to return the input unchanged if no parameters are set and some parameters are needed, so the user isn't greeted with an error message.
- You may optionally use a third argument:
def render(table, params, *, fetch_result)
.fetch_result
will be the value returned by yourfetch()
method. (Read below....) - You can produce an error message by returning a
str
. You can produce a warning by returning a tuple of(pd.DataFrame(...), str)
. - The null table (
pd.DataFrame()
, orNone
or just returning astr
) is special: Workbench won't render it and won't feed it to any other modules in the workflow. This is different from a zero-row table (e.g., an empty "filter" module result), which Workbench will treat as a normal table. - It is safe for your module to modify the input
table
.
Don't query remote APIs in your render
method. Instead, do it in a fetch
. The user controls when fetches happen; Workbench can email the user when fetch results change and it stores old versions so the user can revisit them.
For instance:
import pandas as pd
import datetime
def fetch(params):
return pd.DataFrame({'time': [datetime.datetime.now()]}, dtype='datetime64[ns]')
Tips:
- Fetch happens when the user requests it. The user may set up a timer to fetch periodically.
- Workbench keeps previous fetch results, as long as they fit the user's storage quota.
- You may return
None
to tell Workbench not to store any result. - Since fetch is usually waiting for input, Workbench lets you make it async:
async def fetch(params)
. - You may accept some keyword arguments: for instance,
async def fetch(params, *, workflow_id)
. The full listing:-
workflow_id
: the workflow ID -
get_input_dataframe
: async callback that returns the output of the previous module, orNone
if the previous module isn't rendered or if we're the first module. (Example usage:input_dataframe = await get_input_dataframe()
in anasync def fetch(params, *, get_input_dataframe)
-
get_stored_dataframe
: async callback that returns the previously-fetched DataFrame, orNone
. (Example usage:stored_dataframe = await get_stored_dataframe()
in anasync def fetch(params, *, get_stored_dataframe)
) -
get_workflow_owner
: async callback that reveals the owner of this module's workflow.
-
(experimental)
Your plugin is popular -- very popular. Now you want to address a feature or a bug. The parameters you chose for version 1 of the plugin won't work any more. How do you deploy version 2 of your plugin, with version-2 parameters? You'll need a way to "migrate" the version-1 parameters that are out in the wild. Enter migrate_params()
.
The first thing to know about migrate_params()
is that you may be able to skip it. Workbench's default migrate_params()
supports adding a parameter and changing a parameter's type. (If a version-1 module instance has missing or incompatible values, they'll be replaced with default
s.)
If the default conversion won't do, then write migrate_params(params)
: it takes a dict
argument and returns a dict
. Think long-term: you'll probably want something like:
def migrate_params(params):
if _are_params_v1(params):
params = _migrate_params_v1_to_v2(params)
if _are_params_v2(params):
params = _migrate_params_v2_to_v3(params)
...
return params
# the helper functions might look like this:
def _are_params_v1(params):
# In this example, we're nixing an old parameter and replacing it with
# something new.
try:
params['param_from_v1_but_not_v2']
return True
except KeyError:
return False
def _migrate_params_v1_to_v2(params):
# v1 had a 'param_from_v1_but_not_v2' that was an int.
#
# v2 has 'v2_value' instead, which is boolean.
ret = dict(params) # copy
del ret['param_from_v1_but_not_v2'] # delete param that isn't in v2
ret['v2_value'] = params['param_from_v1_but_not_v2'] != 0 # add v2 param
return ret
This param-migration code must last forever; and it must handle all sets of parameters that may ever have been produced by users. Code your helpers such that you won't need to modify them later; unit-test them so you won't feel scared when you add another migration later. And most of all, add comments in each migration describing the old format and the new format.
Better yet: choose ideal parameters in the first place to avoid needing migrations.
migrate_params()
is run whenever the user views a module. If it raises an exception or returns a dict
that doesn't match parameters
in your JSON module description, the render()
method will never be called; the user will see a Python-esque error message ("ValueError: ..."
) and the user's parameters form will only contain default values.
Set "html_output": true
in your module JSON file to create an HTML output pane. Add a [modulename].html
file to your module's directory, and it will appear in that output pane.
Workbench will display your HTML page in an iframe whenever your module is selected in a workflow. The most common reason is to render a chart.
Your HTML page can include inline JavaScript.
Every Python module produces "embed data": JSON destined for the embedded iframe. By default, that data is null
.
To produce non-null
embed data, make your Python render
method return a triplet of data in this exact order: (dataframe, error_str, json_dict)
. For instance:
def render(table, params):
return (table, 'Code not yet finished', {'foo': 'bar'})
Workbench will encode json_dict
as JSON, so it must be a dict
that is compatible with json.dumps()
.
On page load: Workbench will inject a <script>
tag with a global variable at the top of your HTML's <head>
. You can access it by reading window.workbench.embeddata
. For instance:
<!DOCTYPE html>
<html>
<head><!-- You _must_ have a <head> element -->
<title>Embeddata is set</title>
</head>
<body>
<main></main>
<script>
document.querySelector('main').textContent = JSON.stringify(window.workbench.embeddata)
</script>
</body>
</html>
After page load: Workbench adds a #revision=N
hash to your iframe's URL. That means the hashchange
event will fire every time the JSON data will be recomputed. You can query the embeddata
API endpoint to load the new data.
<!DOCTYPE html>
<html>
<head>
<title>Let's query embeddata from the server</title>
</head>
<body>
<main></main>
<script>
function renderData (data) {
document.querySelector('main').textContent = JSON.stringify(data)
}
function reloadEmbedData () {
const url = String(window.location).replace(/\/output.*/, '/embeddata')
fetch(url, { credentials: 'same-origin' })
.then(function(response) {
if (!response.ok) {
throw new Error('Invalid response code: ' + response.status)
}
return response.json()
})
.then(renderData)
.catch(console.error)
}
// Reload data whenever it may have changed
window.addEventListener('hashchange', reloadEmbedData)
// Don't forget to render the data on page load, _before_ the first change
renderData(window.workbench.embeddata)
// (alternatively: `reloadEmbedData()`)
</script>
</body>
</html>
Simply click on “Import from GitHub” to add a module from GitHub. Workbench will ensure that your module is ready to load and let you know if it runs into any trouble. Once you fix the issue, and commit the changes to GitHub, you can attempt to import the module from GitHub once again.
All imported modules are versioned, by typing the imported code to the Github revisions. Currently applied modules are automatically updated to new module code versions (which can involve adding, removing, and resetting parameters.)
First, Set up a development environment
Start it, but disable the part that saves compiled modules: CACHE_MODULES=false bin/dev start
Next, create a new directory (at the same level as the cjworkbench
directory, a sibling to it) called modulename
. Add these files:
-
README.md
-- optional but highly recommended -
LICENSE
-- optional but highly recommended -
[modulename].py
-- Python code, includingdef render(table, params)
function -
[modulename].json
-- JSON file -
[modulename].html
-- if outputting a custom iframe
In a shell in the cjworkbench
directory, start a process that watches that directory for changes and auto-imports the module into the running Workbench: bin/dev develop-module modulename
Now, edit the module's code. Every time you save, the module will reload in Workbench. To see changes to HTML and JSON, refresh the page. To see changes to Python, refresh the page and trigger a render()
by changing a parameter.