Skip to content

Latest commit

 

History

History
107 lines (80 loc) · 4.76 KB

grammar.md

File metadata and controls

107 lines (80 loc) · 4.76 KB

CSON grammar

This is a formal grammar for the language parsed by pycson. It uses the standard PEG syntax with an extension to support indent sensitivity: for a PEG expression E, the expression E{I=e} will change the meaning of the identifier I to the expression e while matching E.

In CSON, whitespace may contain spaces and tabs. This is more strict than Coffeescript where any [^\n\S] will match. The symbol nl will match one or more newlines that only contain whitespace or comments in between. A match for nl also matches any whitespace preceding the first newline. ews is the "extended whitespace", one that incudes newlines. Note however that ews ending in a comment must be terminated by a newline character.

ws <- [ \t]*
nl <- (ws ('#' [^\n]*)? '\r'? '\n')+
ews <- nl? ws

Atomic values of type null and bool.

null <- 'null'
bool <- 'false' / 'true'

Number can be decimal, binary, hexadecimal or octal. Decimal numbers must not have any leading zeros. The octal prefix is 0o, and therefore numbers like 0775 are not allowed (use 0o755 instead). Hex digits are case-insensitive, but the 0x prefix (and 0o and 0b) must be lowercase. There is no way to make a non-decimal number negative.

number <- '0b'[01]+ / '0o'[0-7]+ / 0x[0-9a-fA-F]+
    / -?([1-9][0-9]* / '0')?'.'[0-9]+('e'[+-]?[0-9]+)?
    / -?([1-9][0-9]* / '0')('.'[0-9]+)?('e'[+-]?[0-9]+)?

Strings are delimited by one of ', ", ''' or """. There is no difference between apostrophes and double quotes, since string interpolation is treated literally. This means that "#{var}" is the same as "\#{var}". Even more importantly, "#{"test"}" is not a valid CSON string.

All escapes are treated literally, except for r, n, t, f, and b, which are treated as is usually in all modern languages, and for a newline character. Escaping a newline character is equivalent to removing the newline and any following whitespace.

Single-quoted strings treat newlines and any following whitespace as a single space. Lines containing only whitespace are ignored. Leading and trailing whitespace is ignored.

For block strings (triple-quoted strings) that contain a newline, the first line is stripped if it only contains whitespace. Similarly for the last line. Escaped newline is treated the same way as for single-quoted strings (removes the newline and any following whitespace). Once assembled, a maximal prefix of whitespace characters that occurs at the beginning of each line is found and stripped from all lines.

string <-
    "'" !"''" string_tail{X="'"} /
    "'''" string_tail{X="'''"} /
    '"' !'""' string_tail{X='""'} /
    '"""' string_tail{X='"""'}
string_tail <- (!X ('\\'. / .))* X

Identifiers may be used instead of strings as keys in objects.

id <- [$a-zA-Z_][$0-9a-zA-Z_]*

Arrays are delimited by brackets. Whitespace is insignificant and the current indent level is reset.

array <-
    '[' (array_value (ews ',' array_value / nl (object / ews simple_value))* (ews ',')?)?{I=} ews ']'
array_value <- nl object / ws line_object / ews simple_value

This rule matches a brace-delimited object. The handling of whitespace is the same as for arrays, the indent is reset.

flow_kv <- (id / string) ews ':'
    (nl object / ws line_object / ews simple_value)
flow_obj_sep <- ews ',' ews / nl ws
flow_object <- '{' ews (flow_kv  (flow_obj_sep flow_kv)* ews (',' ews)?)?{I=} '}'

A simple value is one which is not sensitive to the position within the document or to the current indent level.

simple_value <- null / bool / number / string / array / flow_object

A line object is an unbraced object which doesn't start at its own line. For example, in [a:1, b:2], the array contains one line_object. Note that a line object will never span multiple lines. Line objects have no indent, but they propagate the current indent level to their child objects.

line_kv <- (id / string) ws ':' ws
    (nl I indented_object / line_object / simple_value / nl I [ \t] ws simple_value)
line_object <- line_kv (ws ',' ws line_kv)*?

This is the unbraced object that starts on its own line. It detects its ident level and requires that all lines have this indent. The previous indent level must be a string prefix of the newly detected one.

object <- ' ' object{I=I ' '} / '\t' object{I=I '\t'} / line_object (ws ','? nl I line_object)*
indented_object <- ' ' object{I=I ' '} / '\t' object{I=I '\t'}

A CSON document consists of a single value (either an unbraced object or a simple_value). The value can be preceded and followed by whitespace. Note that a comment on the last line must be terminated by a newline.

root=nl? (object{I=} ws ','? / ws simple_value) ews !.