-
Notifications
You must be signed in to change notification settings - Fork 0
Lexical
The first step in processing Shard language source file is lexical analysis. It splits source code to a set of various tokens.
The Shard supports only 7bit ASCII or UTF-8 encoded source file without BOM (Byte Order Marks). Other encodings are not supported.
The lexer ends processing source code when one of following condition is met:
- Physical end of source file.
-
U+0000
character is found.
End of line can be a sequence of one or two characters in order to support EOL marks from different platforms (Linux, Windows and Mac).
-
U+000A
(Linux & macOS) -
U+000D U+000A
(Windows)
Whitespace is a sequence of characters used for other tokens separation (not necessary). They are probably not important for higher layers so a token is not generated for those characters.
Any sequence of following characters is a whitespace:
-
U+0020
(space) -
U+0009
(horizontal tab)
There are two types of comments: line and block.
Line comment starts with //
sequence and anything following this sequence is ignored until End of Line is found.
Comment starts with /*
and ends with */
. Anything between is taken as a comment. Any /*
inside block is ignored and comment ends when first */
is found.
/* block /* block */ end */
^
not a comment
Any sequence of characters which match following rules is an identifier.
- Starts with alpha character (
a
-z
,A
-Z
) or_
. - Contains alphanumeric character (
a
-z
,A
-Z
,0
-9
) or_
.
Literals are special tokens which represents a immutable value.
Sequence of characters which can represent a number.
- Starts with numberic character (
0
-9
) - Contains alphanumeric character (
a
-z
,A
-Z
,0
-9
).
Represented by any UNICODE character surrounded by single quote '
(U+0027
) character.
A string literal is a sequence of characters surrounded by double quote "
(U+0022
) character.
Within character or string literal an escape sequence can be used. It's handy in cases when required character cannot be specified directly like single or double quote.
- Special characters prefixed by backslash
\
(U+005C
) character.\0
(null character),\\
(backslash),\t
(horizontal tab),\n
(line feed),\r
(carriage return),\"
(double quote) and\'
(single quote). - UNICODE codepoint value as
\un
wheren
is hexadecimal number of the codepoint. The number must be a value in range supported by UTF-8 encoding (U+0000
-U+10FFFF
).
Other tokens have no special meaning in view of the tokenizer but might have in view of tokenizer user. The result token is one printable charater long.