-
Notifications
You must be signed in to change notification settings - Fork 18
Draft String Format
Lib - Workgroup: 0x10c Standards Committee
Authors
This document presents a general format for strings in libraries intended to be shared across programs and with other users.
Shared Libraries depend on certain formats being given. Strings, being a common argument type for cross-library calls, should therefore be standardized. Length-prefixed strings are to be used for security and efficiency reasons.
- word 16 bits. This is the smallest unit addressable by the DCPU.
- character A group of bits representing a glyph or control sequence. A control sequence is a non-printing character that influences the text.
- string A sequence of characters. The length of the sequence can vary at runtime.
- P-string A string whose length in words is indicated by prefixing a word containing that length to the character string. The prefix word itself MUST NOT be included when determining the length.
- C-string A string whose length is indicated by suffixing a NUL character to the character string. The first occurrence of such NUL character terminates the string. The NUL character is the character whose bits are all zero.
- library A group of functions used by different programs.
- program A sequence of machine instructions for the DCPU.
- user Any person using or creating a program.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119
A P-string is composed of a 16-bit length prefix and a sequence of n words where n is the value of the prefix.
The prefix word MUST be present and MUST represent the exact number of following words.
Note that the empty P-string consists entirely of a length word containing 0x0000.
P-STRING = LENGTH BODY
LENGTH = n
BODY = nWORD
WORD = %x0000-ffff
; any 16-bit value
This section is not normative.
The rationale for using P-strings is a simple matter of weighing benefits against disadvantages.
-
Accessing the length of the string is O(1) fast.
-
Buffer overflows are prevented by being able to allocate enough space ahead of time.
This also increases the security of programs by preventing arbitrary shell code from being executed.
-
The null-character can be used in strings, while using it in a C-string would terminate the string.
-
The runtime cost of P-string concatenation is O(n), while the runtime cost of C-string concatenation is O(n+m).
- Indexing begins at 1 instead of 0.
- Fixed-length strings still require a prefixed length word. The same would be true of fixed-length C-strings. Both formats can be abbreviated by omitting the known prefix/terminator at the cost of ceasing to be a P- or C-string per the definitions given here.
- Since the smallest addressable size in the DCPU is a word (16 bits), this effectively allows a maximum string length of 65536 characters without any loss in efficiency.
- Cutting off the beginning of a string is more expensive for P-strings, but cutting off the end is more expensive for C-strings. (Other solutions are possible, but introduce memory leaks.) Therefore, the two arguments negate each other.
- The names P-string and C-string come from Pascal and C, which use the respective format.
- The current implementation of characters in the DCPU effectively leaves the high 8 bits empty. It is therefore possible to store two characters in one word. Such packed strings are outside the scope of this RFC.
Using P-strings reduces the risk of arbitrary shell code being executed by overflowing the input buffer. However, hidden instructions in the string may be executed later when the sections are not cleared and the stack or instruction pointer is moved there. Measures should be taken by the user to ensure this does not happen.