Main

 
Decompiler Token File Formt

Decompiler Token File Format

This document describes an approach to store program related information in a general way. Such information can be used by any tool, which describes or interprets the contents of a program.

General Format Specifications

The files are in text format, for system independent data storage. Currently only single byte character sets are used, according to Ansi Latin1. For other character sets, the file should be in Unicode (2 byte character) format. Line separators can be LF (Unix) or CR/LF (MSDOS).

Numeric values are stored in decimal format, as integers or floating point values. According to most programming languages, the decimal separator is a dot ".", no thousands separators are allowed.

Literals (strings) are embedded in double quotes, according to Ansi C++ convention. The escape character "\" can be used to insert special characters into literals.

Every line in the file describes a single item.
Comment lines can be inserted, preceded by "//".
Header lines can be inserted, enclosed in "[" and "]". Header strings with a fixed meaning may be defined in the future.

All other token lines consist of multiple fields, separated by commas ",". The first two fields contain the numeric token class and subclass values. Currently these values are in the range 0..255.

Every file starts with a version indication, currently defined as "TF00". The digits are intended to indicate a major and a minor version number. The major version defines unique values for the token classes, the minor version defines unique subclasses and formats. Adding new subclasses or fields to records results in an incremented minor version number, so a file reader of version XY can process all files of major version X, without restrictions for all minor versions up to Y, and by ignoring added subclasses or fields for files of minor versions higher than Y. The first major version "0" is intended to cover Basic program descriptions.

Every file should end with a EOF character (^Z), which can be followed by the recommended (minimum) size of the various tables, used to hold the whole information. This information can be found by reading the last few (max. 255) bytes of the file, and searching for an EOF character.

Tables

The information is organized into several tables, the items in these tables are indexed by 1 up to the table size, index 0 is reserved for NULL references (no value applicable or assigned). Items are added to the various tables by according declarations, automatically assigning the next higher index to the new item. The declarations can occur before or after the first reference to an item.

Currently the following tables are defined, with the sizes occuring in this order:

  • Symbol table
  • Type table
  • Token table
  • Line table
More tables may be added, as required. This (first) specification covers Basic specific topics, so only a very restricted set of symbols and types is defined.

Symbols

In most Basic languages only named variables exist, and also named procedures, at least for the built-in (intrinsic) procedures. In other languages also names for types and constants exist, with various attributes like scopes. String constants also can be added to the symbol table, so that all names and literals in a program can be collected in a single table.

So a symbol definition exists of at least:

  • name
  • symbol type
  • location (in source code, or memory address) or value
  • data type (in most cases)

Types

Most integral data types can be described by their byte size and signed/unsigned property.

Real types are implementation specific, and often can be mapped to the Ansi types of 4, 8 or 10 bytes (single, double, extended). All other types (6 or 16 bytes) and variations in the format can be added as necessary.

Character and string types exist in many variations, of different character sizes and representations, and different length indications for strings. The characters are not very difficult to handle, Unicode can be used to store any string literals. Strings can have a static or dynamic allocation, and an actual size. Usually strings are handled by specific subroutines, so a classification for these subroutines is more important than the specification of the storage format. Operators like "[]" can be defined for strings as well as for other data types, which allow for a general description of string operations, regardless of the implementation of the actual string type.

Enumerations, structures and unions also are commonly used data types. Variants can be defined as structures, or as another data type.

Bitfields are another class of types, usually based on some integral type, and can be described like other data structures.

All data types can be organized into arrays and referenced by pointers. Pointers can occur in several types (flat, segmented, short...) and sizes. Arrays can have fixed or variable index bounds, and various memory layouts.

Procedure types can be used on their own, or also as descriptions for every single procedure.

Object oriented data types (classes) can be treated as extended data structures, with various attributes being added to the class members.

More data types can be added, when a special handling of specific data structures is required.
 

Lines

In Basic language line numbers are equivalent to labels, so a line number table is equivalent to a label table. Please note that the line numbers are not necessarily contiguous, so that line numbers are different from their index in the line number table.

Every entry in the line number table contains:

  • line number
  • token index


Other locations are relative to the stack or based on registers.
Stack variables (procedure arguments...) can be represented by their base location and offset, where the offsets can be organized like data structures.
Register variables can reside in one or more system specific registers. Some registers can have a special and fixed meaning, like the instruction and stack pointer.
 

Tokens

All items in a token file are described by tokens. Every token begins with class and subclass index, followed by none or more attributes. The attributes depend on the specific kind of the token. Every token is described in a single line, so the line number can be used to reference every single token.

The following description is not yet complete.

Classes

The first classification was found in a Basic decompiler:

0 General
1 Globals (code)
2 Variables (data, bss)
3 (unknown variables)
4 Arguments
5 Code (executable)
-
6 Compound Statements and Labels (Basic specific)

This classification was found in a C decompiler:

(0..4 same as above?)
-
5 Compound Statements
6 Simple Statements
7 Labels
8 Conditional Statement
9 Argument
10 Complex Expression
11 Multiplicative Expression
12 Additive Expression
13 Shift Expression
14 Logic Expression
15 Monop
16 Auto-Expression (assignment operators)
17 Indirect Expression

This is the intended new general classification:

0 General
1 Symbol Declarations
2 Type Declarations
3 Variables and Constants (data)
4 Procedures and Labels (code)
5 Statements
6 Operators
7 Operands