Programming Language Translator

This project was started to automatically translate Basic programs into Delphi. But it should not be restricted to these two programmig languages, therefore at least provisions are made to hook in more languages.

The central data structure describes an existing program in an abstract way. The top level hierarchy consists of symbol tables, one for each symbol type (variables, subroutines...). All subroutine entries contain lists of arguments, local variables and statements. A "main" procedure is created for all Basic languages, which don't have a specific main procedure.

First the program is parsed, either by a source text parser, or in case of tokenized source files by a token extractor. A general parser converts the tokens into symbol declarations and code trees. The parsed tokens have several attributes, which are specified in formatted text files. Such a file is translated into a Delphi array of constant records, so the according conventions apply:

Every token is described on a single line, all fields are separated by a comma ",". The first few fields of every token record have a fixed meaning, with only a value, and optional extensions can be added by a <name>:<value> pair.

All string arguments must be enclosed in single quotes "'", and quotes inside a string must be doubled "''". All values which contain spaces " ", tabs  or commas "," must be enclosed in double quotes. These double quotes are removed when the file is read.

The order of the fields is important, but fields can be missing, what simplifies the creation of the tables. The fixed fields are described at the begin of the file, by their name, kind and default values.

The kind can be one of "fkHide", "fkStr", "fkLit". Fields of kind fkHide are descriptive (token numbers...), and are not added to the internal tables. Fields of kind fkStr are automatically enclosed in single quotes.

Default values can be defined for empty fields and fields which contain a single question mark "?". Empty fields allow to simply omit the default value from the input file, and question marks indicate unknown tokens, which are typically converted into some error values.
 

The first set of tokens describes "statements", which are typically followed by more "argument" tokens.

Every statement token has the fixed fields "Srcname", "Disp" and "Token". Srcname is the visible representation of the token in the source file, and Disp indicates how to handle the (possibly missing) arguments of the token. These two attributes are used by the scanner or tokenizer. The Token value and all other attributes are handled by the parser.

The following general Token values are defined:
 
tkErr Not translatable token. The whole line is converted into text, which typically is not accepted by a compiler.
tkEOF The end of file marker, terminating the source file.
tkEnd An unspecific end marker for any structure, which doesn't have a more specific token.
tkRem A comment line, inserted into the converted program.
tkData A Basic data statement.
Such statements typically required modifications, depending on the usage of the following values.
tkProc
tkEndProc
These tokens begin and end a procedure declaration.
tkFunc
tkEndFunc
These tokens begin and end a function declaration.
tkDefFn A Basic specific, single line, function declaration.
tkDim Dimension for a dynamic array. Such statements can be translated into static array declarations in other languages.
tkLocal Local variables of a subroutine.
tkAbsolute Declaration of a symbol (variable) at a given place (address, variable).
tkCall Any subroutine call statement. Most Basic statements must be converted into appropriate subroutines, which must be implemented appropriately in the runtime system for the translated program.
Nonstandard names, which include spaces or other non-alphabetic characters, must be redefined in an optional "name" field, like "name:'ON_ERROR'". Please note that the single quotes around the replacement must be present in the source file!
tkPre A prefix, which is not relevant for the abstract representation. Such tokens can be eliminated by the scanner.
tkLet An assignment statement, followed by a symbol token for the target of the assignment, and the source expression, but containing NO further assignment operator token.
tkLetOp C style assignment operators (+=, -= etc.), equivalent to INC or DEC in other languages.
tkLval An assignment statement, starting with the tokens for the target, followed by an assignment operator and the source expression.
The type of the target can be added in a "vt" attribute.
tkDo
tkLoop
Begin and end of a loop. Loops can also have a condition at the begin or end, indicated by added operator tokens op:opWhile or op:opUntil.
tkWhile Begin of a conditional (while) loop.
tkUntil End of a conditional (repeat) loop.
tkBreak Jump out of (endless) loops and other compound statements.
tkExitIf A special test-and-break statement in GFA Basic.
tkFor
tkNext
A counted loop. The first line is expected to contain:
FOR <loop var> [=] <from expression> TO/DOWNTO <end expression> [STEP <step expression>] [DO]
tkIf
tkElseIf
tkElse
tkEndIf
Conditional statements. Multiple statements can occur between these tokens, therefore a tkEndIf token must always be present!
tkSwitch
tkCase
tkDefault
tkEndSwitch
A multiway branch, with a single expression, one or more cases, and an optional default branch.
tkReturn Exit from a function, returning the function result.
tkGoTo Unconditional jump.
tkHalt Dynamic end of the program, optionally returning a value(?)
tkOnGo A multiway branch, followed by an expression, a GOTO or GOSUB token, and a list of targets.
tkMid A Basic specific substring target.
tkLSet
tkRSet
Basic specific string assignements, with left or right justification in an existing string variable.
The length of the target is not changed by the assignment.

The remaining tokens describe arguments, operators or operations, delimiters, separators etc.
The general token classes are:
 
tkBinOp A traditional binary operator (+, -, *, / etc.).
The operation is specified in an "op:" attribute.
tkMonOp A traditional prefix operator (-, NOT etc.).
The operation is specified in an "op:" attribute.
opFunc A function operator, with one or two arguments.
The operation is specified in an "op:" attribute.
argFunc A function call, followed by an argument list, and terminated by a ")"
argFuncN A function call with No arguments.
argConst A constant literal, for numeric values or constant symbols.
opCast A type cast or conversion. The target type is specified in a "vt:" attribute.
tkLPar
tkRPar
"(" and ")" parentheses.
Other parentheses like "[" and "]" also map to these tokens.
tkSep Separator tokens, like "," or ";".
The exact separator is specified in the "op:" attribute.
Separators include "AS", "STEP" etc. keywords.
tkEOL End of line marker, optionally including a remark which is appended to the line.

Operator attributes are:
 
opAdd +
opSub -
opMul *
opDiv /
opIDiv Integer division with an integer result
opMod Modulo operator
opRem Remainder operator
opExp Exponentiation operator (^)
opLet Assignment operator
opShl
opShr
opRol
opRor
Shift operators.
The first argument is the value expression,
the second argument the shift count.
opAnd
opOr
opXor
logical operators
opEqv NOT XOR
opImp logical implication
opNeg
opNot
Arithmetic (-) and logical (NOT) negation.
opLT
opLO
<
The second name stands for unsigned comparison.
opLE
opLS
<=
opEQ =
opLike Near eqality, useful for the equality of floating point values.
opGT
opHI
>
opGE
opHS
>=
opAddr Address of (a variable)
opDeref Pointer evaluation. The target type is given in a "vt:" attribute.