|
|||||||||||
Decompiler Token File FormatThis document describes an approach to store program related information in a general way. Such information can be used by any tool, which describes or interprets the contents of a program.General Format SpecificationsThe files are in text format, for system independent data storage. Currently only single byte character sets are used, according to Ansi Latin1. For other character sets, the file should be in Unicode (2 byte character) format. Line separators can be LF (Unix) or CR/LF (MSDOS).Numeric values are stored in decimal format, as integers or floating point values. According to most programming languages, the decimal separator is a dot ".", no thousands separators are allowed. Literals (strings) are embedded in double quotes, according to Ansi C++ convention. The escape character "\" can be used to insert special characters into literals. Every line in the file describes a single item.
All other token lines consist of multiple fields, separated by commas ",". The first two fields contain the numeric token class and subclass values. Currently these values are in the range 0..255. Every file starts with a version indication, currently defined as "TF00". The digits are intended to indicate a major and a minor version number. The major version defines unique values for the token classes, the minor version defines unique subclasses and formats. Adding new subclasses or fields to records results in an incremented minor version number, so a file reader of version XY can process all files of major version X, without restrictions for all minor versions up to Y, and by ignoring added subclasses or fields for files of minor versions higher than Y. The first major version "0" is intended to cover Basic program descriptions. Every file should end with a EOF character (^Z), which can be followed by the recommended (minimum) size of the various tables, used to hold the whole information. This information can be found by reading the last few (max. 255) bytes of the file, and searching for an EOF character. TablesThe information is organized into several tables, the items in these tables are indexed by 1 up to the table size, index 0 is reserved for NULL references (no value applicable or assigned). Items are added to the various tables by according declarations, automatically assigning the next higher index to the new item. The declarations can occur before or after the first reference to an item.Currently the following tables are defined, with the sizes occuring in this order:
SymbolsIn most Basic languages only named variables exist, and also named procedures, at least for the built-in (intrinsic) procedures. In other languages also names for types and constants exist, with various attributes like scopes. String constants also can be added to the symbol table, so that all names and literals in a program can be collected in a single table.So a symbol definition exists of at least:
TypesMost integral data types can be described by their byte size and signed/unsigned property.Real types are implementation specific, and often can be mapped to the Ansi types of 4, 8 or 10 bytes (single, double, extended). All other types (6 or 16 bytes) and variations in the format can be added as necessary. Character and string types exist in many variations, of different character sizes and representations, and different length indications for strings. The characters are not very difficult to handle, Unicode can be used to store any string literals. Strings can have a static or dynamic allocation, and an actual size. Usually strings are handled by specific subroutines, so a classification for these subroutines is more important than the specification of the storage format. Operators like "[]" can be defined for strings as well as for other data types, which allow for a general description of string operations, regardless of the implementation of the actual string type. Enumerations, structures and unions also are commonly used data types. Variants can be defined as structures, or as another data type. Bitfields are another class of types, usually based on some integral type, and can be described like other data structures. All data types can be organized into arrays and referenced by pointers. Pointers can occur in several types (flat, segmented, short...) and sizes. Arrays can have fixed or variable index bounds, and various memory layouts. Procedure types can be used on their own, or also as descriptions for every single procedure. Object oriented data types (classes) can be treated as extended data structures, with various attributes being added to the class members. More data types can be added, when a special handling of specific data
structures is required.
LinesIn Basic language line numbers are equivalent to labels, so a line number table is equivalent to a label table. Please note that the line numbers are not necessarily contiguous, so that line numbers are different from their index in the line number table.Every entry in the line number table contains:
TokensAll items in a token file are described by tokens. Every token begins with class and subclass index, followed by none or more attributes. The attributes depend on the specific kind of the token. Every token is described in a single line, so the line number can be used to reference every single token.The following description is not yet complete. ClassesThe first classification was found in a Basic decompiler:0 General
This classification was found in a C decompiler: (0..4 same as above?)
This is the intended new general classification: 0 General
|
|||||||||||