The Virtual Machine (VM)

The VM lives in execute.c function mawk_execute_().

In principle it takes an array of instructions (INST), executing them one by one, operating on the current evaluation stack (or stack for short). Most of the execution really goes in order as the compiler prepares up everything to achieve linearity. This makes the execution loop relatively simple and efficient: it is just a large while(dont_need_to_exit) execute_next_instruction;. The actual instruction execution is a real large switch on the instruction opcode.

An usual example on how this is implemented in practice can be obtained by lmawk -Wdump -f test.awk on some simple arithmetic script:
awk asm (VM instructions)
BEGIN { a = 3 + 4 * 5 }
BEGIN
000 pusha	a
002 pushd	3
004 pushd	4
006 pushd	5
008 mul
009 add
010 assign
011 pop
012 exit0

First the lvalue (target variable, left side) of the assignment is pushed, then the expression (right side). The stack is, from top to down: {5, 4, 3, a}. The top of the stack is 5, the second element is 4 by the time mul runs. Mul will replace these two elements by 20, decreasing the stack size by one, leaving the result on the top. Next add does a similar job, replacing the top two items of the stack {20, 3} with their sum, 23. At the end assign runs on the stack {23, a}, removing both items, copying the value 23 to the global variable a. At the end it also puts the result on the top of the stack, leaving the stack as {23} - this is the result (output) of the assignment operation. Since the script doesn't need to use the result, it runs a pop that removes and discards the top item, leaving the stack empty. Since the script didn't have main or END parts, the script can quit at this point, executing the exit0 instruction (exiting with value 0 - the implicit exit).

NOTE: currently there's absolutely no optimization in the parser: everything is calculated as written in the script and some values are saved just to be discarded by the next instruction.

An interesting and important feature of execute_() is that it can save all states and return to the caller at any point of the execution, i.e. between any two instruction in the code. It can also resume execution from the next instruction. This provides the host application full control over scheduling the script, while the script can be built of sequential, blocking instructions.

Jumps and conditions

There are a few instructions that have to break linear execution flow, tho:

Some of the above are implemented using conditional and unconditional jumps to direct addresses (first column on the asm). For example a simple if is compiled to contain 2 jumps:
awk asm (VM instructions)
BEGIN {
	if (bool)
		a = 6
	else
		a = 7
}
BEGIN
000 pushi	bool
002 jz		012
004 pusha	a
006 pushd	6
008 assign
009 pop
010 jmp		018
012 pusha	a
014 pushd	7
016 assign
017 pop
018 exit0

The first one is a conditional jump, "jump if [top of the stack is] zero" (jz) - this makes the VM jump to the else branch at address 10. The then branch ends in an unconditional jump to the next instruction after the if (which is the implicit exit in this example), bypassing the code of the else branch.

A jump is carried out by a simple modification of the "next instruction" pointer before running the next iteration of the execution loop.

Recursion: function calls

A slightly more complicated mechanism is used when jumps are of recursive nature: the code has to jump to somewhere to do some work and then return here and continue execution from the next instruction. A typical example on this is executing user functions.

The original mawk implementation simply called mawk_execute_() recursively. This meant the C compiler took care of saving all internal states on the C stack for the detour. However, this wouldn't allow the code to be suspended during such detour as it would be problematic to rebuild the C stack on a resume.

Thus libmawk's mawk_execute_() does not recurse on C level but on VM level. For example when a function is called (using the call instruction):

Upon a ret instruction from the function:

additional cases of recursion

A range pattern is recursive as well: it needs to evaluate one or two pattern matching before it decides whether to execute the action and/or update the state. The range check starts with instruction _RANGE_CHK which encodes expression code offsets and state in the next few instruction slots. It recurses to evaluate expressions which are terminated by the _RANGE_STOP command. Entering an expression evaluation is similar to a function call while _RANGE_STOP is very similar to a ret.

deep recursion

At any time the eval stack has to have enough space after sp for evaluation the longest awk expression. Any user function recursion will bump sp leaving less room for expressions and further recursion. Relocating the stack (with a realloc()) is not a good idea as there might be cell pointers pointing to stack elements all around. Instead, mawk limits expression length in compile time to a fixed maximum. If entering a new function would not leave at list this amount of eval stack above sp, "deep recursion" is performed. This starts by allocating an entire new stack for the call. Call stacking saves enough pointers so that the code can switch back to the previous stack easily. The allocation is done using zmalloc(), the overhead is minimal. Since the original stack/stacks is/are kept intact, any pointer stays valid. sp points into the new stack block and will increase there until another deep recursion. This wastes some stack space on the old stack (potentially max expression length minus one slot) but guarantees that: - checks and special things need to be done only at entering/leaving functions - even that happens rarely as a stack block is large enough to host many functions besides the longest expression - the stack can grow as big as it likes to, without having to allocate one large block of memory - all allocation is done from normal instance memory - allocation limit, and auto cleanup at the end are granted

Resuming execute_()

Since the far most common thing in an embedded setup is to resume a code interrupted by execution limit or a blocking getline, mawk_execute_() is doing that by default. The top few slots in the eval stack is always a full state dump, the same thing used in recursion. Entering mawk_execute_() pops this section and initializes all internal states from it. When execution needs to be interrupted, mawk_execute_() saves internal states onto the top of the stack.

Entering execute_() (fresh start)

Entering in run state involves setting up internal states pointing to the beginning of the code in question, pushing these states on top of the stack and calling mawk_execute_() which will "resume" from these states. Similar thing happens when the application calls an awk function.

It may be that the execution is interrupted in the middle of running of a large block of code, for example in BEGIN. The top of the stack holds the current execution state so that mawk_execute_() will be able to continue execution. The application may decide to run an awk function before resuming the code: this operation would push a new set of execution state on top of the stack and call mawk_execute_(). When the current state finishes at the _RET instruction, mawk_execute_() would take the next frame from the stack and would automatically resume execution of the interrupted BEGIN block. This would cause the return value of the function to be lost and would attempt to resume BEGIN as a side effect of the function call!

To avoid such confusion, any new enter to mawk_execute_() is required to push two sets of states: an EXEST_EXIT and the actual state it wants to "resume" at (start execution at). When mawk_execute_() hits the _RET instruction in the above example, it does pop the next frame, but that frame would be the EXEST_EXIT which would cause it to interrupt immediately. This leaves the stack exactly as it looked like before the function call, and the application later may decide to resume execution.

Fresh start entries: