The pawn language
An embedded scripting language
Language features
- pawn is a deterministic language: it will run at the same speed every launch —there is no garbage collector that kicks in. The compiler can estimate the memory footprint for the script (unless the script contains a recursive procedure), and that memory footprint is fixed and stable as well.
- The pawn language is derived from C, and in some ways I have tried to “fix” syntaxes of what I see as counter-intuitive or error-prone in the C language. The cases in a “switch” statement are not fall through, for example.
- pawn supports pass-by-value (like C) and pass-by-reference for function arguments.
- pawn supports default function arguments, for arrays and simple variables. The arguments with a default value do not have to be at the end of the argument list.
- pawn supports named parameters together with the conventional positional parameters.
- pawn supports states and automatons (state machines) directly in the language, including state-local variables. Doing this in the compiler allows for flexibility and optimal performance, as well as having the compiler verify the constraints of the automaton.
- pawn has no “struct”s, but it extends arrays so that it can mimic light-weight structures with arrays; as an example of those extensions, pawn supports array assignment and array indices can be assigned an individual “identify” (or “purpose”) and even form a sub-array.
- Array sizes are deterministic, but array declaration is flexible. All dimensions in a multi-dimensional table may be variable size, for example.
- pawn supports symbolic constants,
conditional compilation and assertions, as
well as text substitution via a kind of “pre-processor”. The advantage of
unifying the pre-processor with the compiler is that the “
#if
” knows aboutvariable and “const” declarations (which is not the case in standard C. - If you want to say that something costs 3 Euro, you might write “printf("price \2003\n");” in C (where octal 200 is decimal 128, which is the position in the updated ANSI character set that the Euro symbol resides in). This works for 8-bit characters, but what about 16-bit/32-bit Unicode/UCS-4 characters? In pawn, the numeric character code is optionally explicitly terminated with a semicolon, just to avoid this problem. So in pawn, you would write: “printf("price \128;3\n");” (in pawn, numeric character codes are in decimal, rather than octal). This is just one example of those little and minor annoyances that exist in C/C++ and that pawn addresses.
- With packed and unpacked strings, pawn is able to bridge ASCII and Unicode subsystems. The compiler accepts source files in 8-bit ASCII and UTF-8. When running in ASCII mode, the compiler can translate extended ASCII characters to Unicode, based on codepage tables —this even works for MBCS codepages.
- The integer division and modulus operators ("/" and "%") are well defined for negative operands, unlike C (ISO C89) and C++. The ISO C99 standard (finally) defines “truncated division” as standard; Java also uses truncated division. pawn uses “floored division”, as defined by Donald Knuth and as is also used by Haskell and Python.
Toolkit features
- pawn comes with an implementation of an abstract machine in portable C. (C90 standard) The abstract machine is a set of C functions that you can easily link to an application or function library. By compiling the source code to P-code for an abstract machine (or “virtual machine”), pawn is much faster than pure interpreters.
- pawn is certainly among the fastest of the scripting languages, especially when using an abstract machine in hand-crafted optimized assembler or a JIT. These optimized abstract machines are not very portable, but versions for Windows, Linux (running on x86 architecture) and ARM architectures exist.
- The abstract machine for pawn lends itself well for embedding: the abstract machine has low overhead and multiple abstract machines may run concurrently in a process; the interface to native functions (in C/C++) is flexible and also has low overhead; no components other than a few functions from the standard C library are required to build the abstract machine. The abstract machine does not require dynamic memory allocation (or garbage collection) or file I/O. The abstract machine itself is ROM-able and it requires very little RAM.
- For memory-constrained devices, pawn scripts can run directly from ROM. Alternatively, pawn has compiler support for “code overlays” where chunks of code are read from a storage (for example a SD/MMC card, or Flash ROM) on an as-needed basis.
- For a little language, the pawn compiler has a pretty good error system. I dare to compare it with commercial level C/C++ compilers with all warnings enabled.
- Although the debugger for the pawn abstract machine is primitive, it is an order of magnitude better than the “printf” style of debugging that is so common for little languages.
- For a little language, I would also say that pawn is nicely documented. Many people learn by example, so the August 2011 edition of the “Language Guide” contains 17 complete pawn programs that cover a variety of topics and 21 pawn code snippets that form complete functions. Various extension modules come with their own documentation with more example programs. The “Implementers Guide” contains 10 C/C++ code snippets (for embedding the abstract machine in your application) that build a complete program (with various options) when put together. pawn is also one of the very few scripting languages that documents the abstract machine: the pseudo-instructions, memory layout and pseudo-registers.
- pawn runs as a 32-bit language on 8-bit and 16-bit processors; it supports fixed-point or floating-point arithmetic (when available on the platform). You can also build pawn as a 64-bit language, too.
- The security model of pawn is based on the robusteness of the tools and on execution of the P-code in a “sandbox” environment. There is a caveat: native functions (provided by the host application or executing environment) can do anything; the implementer of such native functions should address the appropriate security issues.
- pawn scripts can be encrypted, to avoid them being decompiled or “patched”.
Creating a robust language
The pawn language is designed with an audience of “non-expert” programmers in mind. Beginning programmers make particular kinds of errors frequently. Expert programmers do not just make fewer errors, their most frequent errors are also different than those made by beginning programmers.
The C language provided the basis for pawn; basically the ambition of pawn was to design a simplified and modified C in such way that pawn would avoid or circumvent the mistakes that novice C programmers make most frequently. The beauty of C is that the language comes of age and that it has been extremely popular. There is, hence, a wealth of experiences, both favourable and unfavourable, on the language. Many of the “holes” and culprits in the language are known. Make no mistake, designing a new language is no mean feat. The number of “deprecated” features in modern languages like C++, Java or even RUST shows that even seasoned language designers sometimes come to the conclusion that a feature once reckoned useful turned out to be seriously problematic.
What are frequent (beginner) errors, and what does pawn do about it?
- Forgetting to close a /* ... */ style comment - Allowing nested comments can hide a bug of this kind for a long time, leaving the programmer to guess where the comment started. pawn does not allow nested comments and, although it is a frequently requested feature, I do not intent to add it. In addition, the pawn compiler issues a warning when it encounters a /* token inside a comment. pawn also supports single line comments, even though these have their own problem: see below.
- Forgetting a “break” as the last statement below a “case” - pawn does not need a “break” to avoid dropping into the next “case”. Case statements in pawn are not drop through.
- Using octal radix constants by accident - I have never used octal radix for any of my programs and I know of no strong argument in favour of it. pawn does not support octal radix. (As surprising as it may sound, I have seen beginning C programmers prefixing decimal values with one or more zeros at several occasions.)
- Typing a ';' behind an if, for or while statement - pawn forbids the semicolon as the empty statement. An empty statement in pawn must be written as “{}”. The chance that this is typed by accident is much lower than that of a semicolon.
- Forgetting the semicolon to end an instruction - In pawn,
semicolons to end a line are optional. The syntax and most of the rules
were copied from BCPL.
It is common practice to write no more than one instruction per line, but if one does, he or she must still use the semicolon to separate two instructions on a line. - Misspelling a variable name - This is mostly caught easily by the compiler. The biggest problem is when the misspelled variable name happens to be the name of another variable that happens to be in scope. Like C, pawn supports block-local variables, so that the scope rules may make this condition less probable. pawn's “tagname” system can also help catch this error automatically.
- Forgetting to declare a new variable that you use in an expression - pawn helps in this case by allowing you to declare a new variable where you need it. In old-style C, you must add the declaration at the beginning of the block. In Pascal, you must declare variables at the beginning of a function.
- Re-using a variable for a different purpose, while its original value is needed later on - This mostly happens when one needs a temporary variable for a simple loop or to store an intermediate result in. pawn helps, again, by allowing to declare a new (temporary) variable where it is actually needed and by allowing a loop variable to be declared in the first expression of a “for” statement.
- Forgetting to encapsulate multiple statements (of an “if” or a loop) in a compound block - pawn is a free format language, but it warns if the indentation level of statements in the same block changes. This catches most instances of this kind of error.
A note on single line comments
When you end a single line comment with a backslash in C, what happens?
The backslash at the end of a line is the “line continuation” character of the
pre-processor of C/C++. Two possibilities would be:
- the pre-processor treats the backslash as part of the comment and ignores it; line continuation does not occur
- the pre-processor extends the single line comment over the next line, meaning that the “single line comment” actually comments out two lines
Two recent compilers for Microsoft Windows that I tried this on used the latter approach. In “The New C: About // Comments” (C/C++ Users Journal, July 2003), Randy Meyers writes that this issue is indeed ambiguous in C, but that the C++ standardizes the second approach. pawn uses the first approach and also issues a warning message.
If you read my note above on programming languages being difficult to design because features sometimes “bite” each other, I think this is an example of two such features. Both are useful in their own right, but combining them can lead to much confusion. This is not just a hypothetical, theoretical issue either: a colleague once spent a few hours in tracking a bug where a single line comment happened to be extended over the next line (due to exactly this issue). He claimed that the compiler did not issue any warning (when set at the highest warning level), the syntax highlighting in the editor did not recognize this special case. I have heard a similar story of another programmer.
What about pointers?
I read in more than one book on Java and in more than one description of
other scripting languages that the language was “robust” or that it
avoided bug-prone practices by virtue of not having pointers. Since I have experience
in both teaching the C language (with pointers) at a novice level and in
programming C at an advanced level, I think I can comment on this. Indeed, in
my experience, novice programmers make a lot of pointer errors. To test myself
in this aspect, I noted serious errors that I made during a two-day
implementation of a new code that would be embedded into an existing
application. I wrote over 500 lines of C code in those days. With “serious
errors” I mean that I only counted those errors that caused invalid memory
accesses. Logical errors and syntax errors were waved (the compiler catches
syntax errors anyway).
In those two days, I made several array indexing errors, where the index was out of range, negative, or just garbage (uninitialized). I did not make any pointer mistake (the code has now been in use for years, and no bug was ever reported). There were approximately equally many pointer as array operations in the code. An interview of a colleague (an experienced programmer) learned that pointers were not a major source of errors in his code either. So based on an amateurish empirical study with a population of a mere two persons, I dare to challenge the statement that, in general, pointers should be considered more harmful than arrays.
The pawn language has no pointers, but you won't catch me saying that, therefore, pawn is a robust language. The current implementation for the abstract machine for pawn performs bounds checking on array indices (if it knows the bounds of the array) and that makes it a little robust. The abstract machine also verifies that every indirect memory access stays within the address range defined for each program, which also gives pawn some robustness. But making a logical error in a mortgage calculation that'll cause you great misery is as easy in pawn as in any other language... with or without pointers.