Skip to content

Lexical

Jiří Fatka edited this page Apr 29, 2018 · 13 revisions

The first step in processing Shard language source file is lexical analysis. It splits source code to a set of various tokens.

Source code

The Shard supports only 7bit ASCII or UTF-8 encoded source file without BOM (Byte Order Marks). Other encodings are not supported.

End of File

The lexer ends processing source code when one of following condition is met:

  1. Physical end of source file.
  2. U+0000 character is found.

End of Line

End of line can be a sequence of one or two characters in order to support EOL marks from different platforms (Linux, Windows and Mac).

  1. U+000A (Linux & macOS)
  2. U+000D U+000A (Windows)

Whitespace

Whitespace is a sequence of characters used for other tokens separation (not necessary). They are probably not important for higher layers so a token is not generated for those characters.

Any sequence of following characters is a whitespace:

  1. U+0020 (space)
  2. U+0009 (horizontal tab)

Comment

There are two types of comments: line and block.

Line comment

Line comment starts with // sequence and anything following this sequence is ignored until End of Line is found.

Block comment

Comment starts with /* and ends with */. Anything between is taken as a comment. Any /* inside block is ignored and comment ends when first */ is found.

/* block /* block */ end */
                    ^
                    not a comment

Identifier

Any sequence of characters which match following rules is an identifier.

  1. Starts with alpha character (a - z, A - Z) or _.
  2. Contains alphanumeric character (a - z, A - Z, 0 - 9) or _.

Literal

Literals are special tokens which represents a immutable value.

Number literal

Sequence of characters which can represent a number.

  1. Starts with numberic character (0 - 9)
  2. Contains alphanumeric character (a - z, A - Z, 0 - 9).

Character literal

Represented by any UNICODE character surrounded by single quote ' (U+0027) character.

String literal

A string literal is a sequence of characters surrounded by double quote " (U+0022) character.

Escape sequence

Within character or string literal an escape sequence can be used. It's handy in cases when required character cannot be specified directly like single or double quote.

  1. Special characters prefixed by backslash \ (U+005C) character. \0 (null character), \\ (backslash), \t (horizontal tab), \n (line feed), \r (carriage return), \" (double quote) and \' (single quote).
  2. UNICODE codepoint value as \un where n is hexadecimal number of the codepoint. The number must be a value in range supported by UTF-8 encoding (U+0000 - U+10FFFF).

Others

Other tokens have no special meaning in view of the tokenizer but might have in view of tokenizer user. The result token is one printable charater long.