Skip to content

Exported tables and their columns

Adam Hooper edited this page Aug 2, 2021 · 6 revisions

Workbench deals in tables. When you download a table from Workbench, it can be in one of these formats:

Refer to Column Types for Workbench's internal specifications.

Parquet

Parquet is how Workbench stores tables. It stores data by-column, not by-row: this makes math remarkably efficient.

The tables Workbench exports are identical to the tables Workbench uses itself.

Here are the only logical types you should expect -- and the only metadata Workbench defines for them:

  • Signed Integers (up to INT64); FLOAT; DOUBLE: Python treats all these as "number".
    • Don't expect Workbench to always return "number" in the same format: it often changes a "number" column's storage format (as Excel, JavaScript or Python do).
    • The only metadata is format, a UTF-8 string like ${,.2f}, compatible with both Python PEP3101 and d3-format.
    • Workbench does not produce the special floating-point values NaN, Infinity or -Infinity.
  • DATE
    • The only metadata is unit, with ASCII value day, week, month, quarter, year. The metadata determines which values are valid: for instance, week means all values are Mondays; quarter means they're all Jan. 1, Apr. 1, July 1 or Oct. 1.
  • TIMESTAMP(isAdjustedToUTC=true, unit=NANOS), always UTC nanoseconds, with no metadata.
  • STRING, with no metadata.
    • Each column may be dictionary-encoded. Don't expect Workbench to always or never dictionary-encode a column: its decision-making process is complex and subject to change.

Beware: all columns allow nulls. Each column can be stored as two arrays: an array of values, and an array of "is-null" markers. This can be awkward in some tools, such as Numpy.

JSON

JSON is a convenient means of sharing structured data. Workbench stores a table as an Array of row Objects, keyed by column name.

Here are the only logical types you should expect:

  • Numbers: int64 or double-precision.
    • Workbench outputs exact 64-bit integers in JSON; but JavaScript's JSON.parse() will round any number above 9,007,199,254,740,991. You can work around this with json-bigint.
    • Unlike Parquet, there is no format metadata.
  • Dates: YYYY-MM-DD Strings.
    • Unlike Parquet, there is no unit metadata.
  • Timestamps: YYYY-MM-DDTHH:MM[:SS[.sssssssss]]Z (RFC3339) Strings.
    • The timezone is always Z (UTC).
  • Strings.

Any value may be null.

CSV

Comma-Separated Values (CSV, RFC4180) is a ubiquitous -- but lossy -- means of transmitting rows of text values.

CSV only transmits text; here's how Workbench maps its values to text:

  • Numbers: int64 or double-precision.
    • Large floating-point numbers can use decimal exponents: for instance, 2.3e52. This notation is compatible with JavaScript's Number() or C's atof().
    • Unlike Parquet, there is no format metadata.
  • Dates: YYYY-MM-DD Strings.
    • Unlike Parquet, there is no unit metadata.
  • Timestamps: YYYY-MM-DDTHH:MM[:SS[.sssssssss]]Z (RFC3339) Strings.
    • The timezone is always Z (UTC).
  • Strings.

CSV does not transmit null. Any value may be an empty string. In Number/Date/Timestamp columns you can infer that empty-text = null; but in String columns there is no way to distinguish between empty string and null.