Sergey Polak
Pace University,
Department of Computer Science
Advisor: Dr. Allen Stix
It is a mistake to consider the prime characteristic of high-level languages to be that they allow us to express programs merely in their shortest possible form and in terms of letters, words, and mathematical symbols instead of coded numbers. Instead, the language is to provide a framework of abstractions and structures that are appropriately adapted to our mental habits, capabilities, and limitations.Niklaus Wirth
It should be obvious to anyone engaged either in the academic aspects of computer science or in real-world programming that there is no such thing as the ultimate programming language. There are hundreds, if not thousands of languages out there, and while one may argue a certainly language is better than another, none of the languages could be called perfect. Whether the language be relatively obscure or extraordinarily popular, it has many redeeming qualities as well as a large number of flaws. While the obscure languages may not get much attention, the more popular ones, such as C, C++ and Pascal get a great deal of attention from both the programmers and the computer scientists alike. The many flaws of the language are brought out and the various features that make it the language of choice for one task or another are illustrated in articles and books. Pascal's many strengths and weaknesses are well documented in the scientific community, and most programmers who have had experience with Pascal have their own views. The immense popularity of C and C++ both in the academic community and in the business world have made it a subject of many a praise and even more critiques.
The main point I am trying to illustrate is that no language, no matter how popular, is universally accepted as the perfect programming language. My own experience in programming over the years has brought me in contact with features that made me almost giddy with excitement and with features that drove me to pulling my hair out. When Pace University switched to C++ as its teaching language in the department of Computer Science, I couldn't wait to learn the language that is so widely used in the business world. My excitement was short-lived, however, and complaints about many aspects of C++ poured forth in my programming classes. After one such complaint, Dr. Mary Courtney challenged me to write my own language, and the idea was deeply ingrained in my head. Actually writing a programming language is a task far too immense for me to undertake at this point. What I will do in this paper, however, is move in the general direction of suggesting a new language. I will suggest features of a language that are useful to programmers, features that make a good programming language. In discussing these features I will examine existing languages, point out their flaws, make suggestions for new features and gather good features from many languages together. It is my hope that this work can eventually lead to an actual programming language.
Before going further it is necessary to define the aspects of a programming language that make it good. I will focus on five such areas that I have identified:
1) A language must be expressive. The programmer can not waste time inventing clever ways of solving trivial problems, nor can he be forced to deal with mundane issues. The language must provide a rich set of features, both as language constructs and as library routines, or the task of writing a program becomes too cumbersome to get any real work done.
2) A language must be extensible. It must be easy and convenient for a programmer to extend the language by defining his own libraries. A language that can not be easily extended is entirely useless to the business community, for nearly every software company has its own custom library or two. More importantly, however, these extensions must blend into the language and appear indistinguishable from native language constructs, because their use has to be as natural to the programmer as the use of built-in features.
3) A language must be readable. Common wisdom in the programming community holds that time spent maintaining code is immensely greater than the time spent actually writing it. When one considers this little fact, for it truly is a fact, the value of readability in a language becomes apparent. Those computer scientists who have tried to read other people's code will agree in a heartbeat that it is hard enough to decipher another person's thinking, let alone when it is obfuscated by an unreadable language.
4) A language must promote correct, safe programs. Program verification is still very much a theoretical topic, yet the programs we write must be correct and must operate predictably in all cases. Without a fool-proof way to verify programs, programmers must rely on various techniques to ensure the correctness of their programs. A programmer writing in a language that promotes the writing of a correct program has a very big advantage over someone programming in a language that does not, because, as I pointed out above, the time spent maintaining code is much greater than the time spent writing it. Debugging any program, even a small one, is a task of Herculean proportions, and any language that reduces the possibility of programmer error and allows a programmer to build safety features into the program is a great language.
5) A language must be portable. The need for portability may not seem obvious to someone who is faced with one type of computer in the office or at school, but the business world craves portability. Even considering that the Intel microprocessor is the de-facto world standard, the market for software written for other types of processor is huge. A language must shield the programmer from the various issues that arise in porting a program from one platform to another, so he can focus on improving the program rather than trying to figure out how to deal with minute differences in machine architecture.
While I am certain the above list of features is by no means exhaustive, it seems to me to cover the most important aspects of programming.
With the key areas outlined, the only thing that is left to decide before going on is what type of language is best. Over the years a number of different kinds of programming languages have emerged: imperative, functional and object-oriented, among others. The type of language fundamentally alters the structure and the features that correspond to each of the five areas I outlined, so it is impossible to talk about these features without knowing what kind of language they will go into. Each of the language types has qualities that speak in its favor and features that detract from the language's usefulness.
Imperative programming is one of the more natural ways to program, because humans tend to think in terms of actions and commands. Object-oriented programming allows for easy simulation of real-world objects, and makes many concepts that require additional structure in imperative programming seem natural and obvious. Functional programming advocates a strict, mathematical approach to programming and allows for extraordinarily elegant and simple solutions to some common problems, such as quicksort. My experience working with the three types of languages seems to suggest that the ideal language would be a good balance of imperative and object-oriented programming with a few elements of functional programming mixed in. None of the types provide a good working environment for the programmer separately.
A purely imperative language makes the task of hiding data difficult, and makes it more cumbersome to bind code to data. In C or the many extensions to standard Pascal, including Borland Pascal, information hiding must be accomplished with libraries. This introduces additional structure into the language that is not needed for any other purpose than to hide the information. The necessary syntax would be too cumbersome to demonstrate. Hiding information in standard Pascal is simply not possible, because Wirth did not add the module concept to the language. The alternative syntax in an object-oriented language, however, is easy and straightforward. The following code is in Object Pascal:
TExample = CLASS
PRIVATE
MyVar1: INTEGER;
PUBLIC
MyVar2: INTEGER;
END;
The syntax very clearly and easily states that MyVar1 is a private variable that belongs to objects of type TExample, whereas MyVar2 is public. No separate units, libraries or files are needed to prevent any code not belonging to TExample from modifying the private variable.
A language that is purely object-oriented, while allowing for easy and powerful simulation of the real world with objects and classes, is not suitable for every programming task. The contortions one has to go through in Java, for example, to write the famous "Hello, world!" program would make any self-respecting programmer cringe with disgust. Witness the code:
public class HelloWord {
public static void main(String[] args) {
System.out.println("Hello, world!");
}
}
Not only does one have to create a custom class to print out a mere thirteen characters, but the peculiar concept of static functions has to be introduced (static functions in Java are functions that can only operate on the class itself, not any instance of it; they can also only use local or static variables). When compared to the same code written in standard Pascal or in C, the Java code looks even uglier:
| (C) | (Pascal) |
|---|---|
#include<stdio.h>
void main() {
printf("Hello, world!\n");
}
|
PROGRAM HelloWord;
BEGIN
WRITELN('Hello, world!');
END.
|
The code is clean, simple and to the point. No extraneous concepts need to be introduced, at least in the Pascal version. The C version becomes slightly more complicated because of the inclusion of a header file, but that issue is not nearly as complex as the discussion of classes and static functions that must take place before the program can be written in a purely object-oriented language such as Java.
A purely object-oriented language simply takes things too far. In Smalltalk, for example, everything is an object, from the integer 1 to the boolean TRUE. Even loops are done in object-oriented style. The following code performs a simple copy of input data to output in Smalltalk:
[inputStream atEnd]
whileFalse:
[outputStream nextPut: input Stream next]
The way to understand it is that the code to write the next character from the input to the output is passed as the argument to the message whileFalse which is being sent to the block that determines if the input stream has ended. The code seems odd when compared to the similar code in standard Pascal:
WHILE NOT EOF(Input) DO BEGIN READ(Input, C); WRITE(Output, C); END;
The above code assumes that a character variable C has been declared. Even with that assumption, it is much clearer and easier to understand: until the input file has reached the end-of-file, read a character from the input and then write it to the output.
Functional languages, in their strife for a mathematical and organized approach to programming often require a technique that is not natural. Functional programs emphasize recursion and introduce concepts that make the languages of this type difficult to program in and unsuitable for most applications. Some problems, however, can be solved with remarkable elegance. The following example code in ML, which checks to see if an element is in a list, would be nearly impossible to write in an imperative language, and would require advanced knowledge of object-oriented design to accomplish in an object-oriented language (though it would still not be as simple and elegant):
fun Member(_, []) = false
| Member(x, n::L) =
if x = n then true
else Member(x, L);
The elegance of the code is even more astounding when one considers that the code would work on any list. That is, as long as the list is of items of the same type as the item one is looking for, this code would work perfectly. To accomplish this in an imperative language the programmer would have to play with fire and use pointers, while an object-oriented programmer would have to rely on inheritance and a list class to solve the problem.
Being that no language type is best suited for all problems, the only natural thing to do is to combine aspects of the three types of programming languages into one. A language such as Object Pascal or C++, which provides object-oriented tools on a solid foundation of an imperative language is very close to being ideal. Adding some of the features of functional programming without detracting from the power of imperative and object-oriented style brings the language even closer to being perfect.
Now that the type of language is settled, it is time to turn to discussing the five areas that make or break the language. In the following discussion the order in which I talk about certain features is in no way related to their importance, but is purely arbitrary. Furthermore, examples of features that would go into a new language are given in a Pascal-like syntax. It is my firm belief that the Pascal syntax is better than other possibilities, and this is a topic that I will discuss at length further.
The expressiveness of a language is one of its most important features. If the programmer has to spend time writing code to deal with a deficiency in the language, the underlying program suffers. It is simply imperative that the language give the programmer the tools he needs to easily implement everything from the simplest to the most advanced programming concepts.
As my first example of language deficiency in terms of expressiveness, I suggest one of the more common data structures: the linked list. On occasion the programmer has to deal with complicated implementations, such as doubly-linked lists, but the plain and simple one-way linked list, whether sorted or not, is a data structure that's highly useful. Dealing with a linked list is equally cumbersome whether one is using an imperative or an object-oriented language. The following example declares a linked list data type in Pascal:
TYPE
PList = ^List;
List = RECORD
Data: SomeType;
Next: PList;
END;
If the programmer wants to use linked lists in a program, he must use pointers. That alone isn't too bad, since Pascal, being a strongly typed language, puts some restrictions on what one can do with pointers. The problem, however, is that the programmer simply can not write any standard set of routines that would handle linked lists. Since each list is a different type, a new set of routines must be created to handle the list.
Someone programming in an object-oriented language would probably jump up and exclaim that they have a wonderful solution to this problem. It is indeed possible to create one set of linked-list routines that would handle all linked lists. To do this, a standard linked list class, say TLinkedList, is created. Routines that operate on this class are then written. Any time the programmer needs a new linked list, he would derive a new class from TLinkedList and, through the virtue of inheritance and polymorphism, still be able to use routines written for the original TLinkedList. The trick here is that all classes derived from TLinkedList have all the same properties as TLinkedList itself, and can be used anywhere TLinkedList is expected.
To someone whose experience has been only with imperative and object-oriented programming, this approach would seem like a wonderful solution. And indeed it is a marked improvement on the situation one faces when programming in a language like Pascal. But the contortions one has to go through to get the desired effect far outweigh the benefit, especially when one considers how functional programmers deal with linked lists. The ML list construct simply makes the list type a part of the language. No custom functions or data types are needed to traverse the list or add an element to the list. A simple declaration, like:
TYPE ListOfIntegers = LIST OF INTEGER 1
would be all that's needed to create a linked list that holds integers. No record declarations and no class derivations are needed. The programmer is free from having to worry about memory allocation, about writing code for initialization of list head pointers and next pointers in the structure. With the list a data type integral to the language, a simple statement, like:
A := [1, 5, 99, 30];
would be all that's needed to create a linked list of four integers on the fly and assign it to a list variable A. This example is extremely simplistic, yet it illustrates that far more powerful applications of such a tool are possible.
With multiprocessor computers becoming cheaper and more commonplace, and the mainstream operating systems (read as "Microsoft Windows") moving towards multiprocessing, the need for a language that gives the programmer the basic tools needed for writing software that can take advantage of more than one processor has never been greater. Some languages have attempted to tackle the problem by building complicated concepts directly into the language, as in Ada. But the way multiprocessing is handled is highly dependent on the operating system. No matter what anyone says, people who write real software that will be used in the real world have to deal with the operating system the software will run on. Cross-platform portability, when it comes to multiprocessing, has to be the task of the people who write the compiler and operating-system specific libraries. What the language must provide is the basic tools with which the people who design the compiler packages can create libraries and routines that will help the programmer. "Elaborate tasking models have no place in a systems-implementation language. A few simple mechanisms are needed, not a complicated structure." (1) I believe the statement holds true for a general-purpose language as much as for a systems-implementation language.
The are a number of concepts that have been developed for multiprocessing. Among those are the few simple mechanisms that allow for a good foundation. A construct to allow for parallel processing of statements is one such mechanism. A construct similar to:
PARABEGIN A := 125; B := 99; C := 'Hello, world!'; PARAEND;
would be all that's needed for a programmer to indicate that statements can be executed in parallel on multiple processors. The syntax would fit in quite nicely in a block-structured language and nesting of parallel and sequential blocks would look natural.
Semaphores are needed in a language to go with the PARABEGIN construct. While it certainly is possible to simulate semaphores, a language that is serious about multiprocessing would have a dedicated semaphore type and the implementation would take all the appropriate steps when dealing with semaphores, taking advantage of special hardware or operating system instructions. The semaphore type that seems most useful to me is the Proberen-Verhogen (PV) semaphore. By issuing simple P() and V() instructions on these semaphores the programmer can control multiple processes and their access to critical sections. Due to their nature, it may be a good idea to make the semaphore type compatible with integer variables, but that raises a type-compatibility issue the discussion of which is beyond the scope of this paper.
Another tool the language must provide is the ability to start and control threads. A facility that includes built-in procedures for starting a thread, checking a thread's state and for killing a thread must be provided. Myself having very limited experience with writing code for multiple threads, I can't say much about the specifics of such facilities. The exposure to the Java implementation of threads, where the programmer is forced to derive a new class from the library class Thread, left me with a desire for a more elegant solution. Regretfully I don't yet have the background to suggest one.
Different languages have different approaches to arrays, with some being more expressive than others. The array being a highly useful construct, I often found the facilities provided by a language to be too constricting. My frustration is probably the greatest with the approach C took to arrays. First the mere notion that an array is a pointer is enough to make one's hair stand up. The following sentence from the Kernighan and Ritchie manual for the C language illustrates this repulsive idea: "The expression E1[E2] is identical (by definition) to *((E1) + (E2))." (2) Whether done for the sake of efficiency or for some other reasons, this awful concept isn't the only thing that destroys the usefulness of arrays in C and makes them highly dangerous constructs to use. What makes matters much worse is that in C, and subsequently in C++, all array subscripts begin at 0. If the application calls for an array with subscripts ranging from -15 to 134, the programmer must write his own code to do the necessary arithmetic. Forcing the programmer to deal with this artificially imposed limitation takes up valuable time and introduces new sources for error in code, sources that simply would not be there in a language such as Pascal. R.P. Moody, though talking about the use of C as an educational medium, hit the nail on the head regarding the C language: "When C is used as a medium of instruction, irrelevant issues become overwhelmingly large in number and significance. When the irrelevant becomes significant, the essentials become obscured and incomprehensible." (3) The restriction on array subscripts is just one example of an irrelevant issue that obscures the essentials.
There is nothing, absolutely nothing, wrong with allowing the programmer to define array subscripts that do not begin with 0. The compiler should be the one doing the necessary arithmetic if the application calls for an initial subscript of -15 or of 65, not the programmer. Nor should the programmer be limited to only using numbers as his subscripts. Pascal takes the right approach by allowing booleans and enumerated types to serve as array subscripts. The only reason this is allowed in C is that a character is really an integer and not a separate data type in and of itself.
Array subscripts have been much debated in the programming community because of the need for dynamic arrays. The C approach to dynamic arrays is through the use of the aforementioned "array is a pointer" pun. By dynamically allocating memory and then using the array notation with a pointer variable, the programmer can effectively create an array of any size at any time it is needed in the code. An alternative solution, as in Modula-2 and some implementations of Pascal, is to provide for open-array parameters. That is, a parameter can be declared as ARRAY OF sometype, and an array of that base type, of any size, can be passed to the procedure.
The Modula-2 solution is not a solution at all, even though it is a highly useful facility in any programming language and should by no means be discarded. The only modification I would like to see is the change in parameter notation to allow specification of the number of dimensions, and two built-in functions, perhaps called LBound and UBound, that would return the lower and upper bounds of the subscripts, as opposed to reverting to the annoying C way of starting the subscripts off at 0. The C solution is certainly viable, but so much has been said about the danger of pointers that using them where other means of solving the problem can be employed seems to be a crime. I see no reason why a language should not allow the programmer to specify the type of an array, including the number of dimensions, without explicitly specifying the size. A call to some built-in procedure, perhaps even the same NEW procedure used for dynamic memory allocation, would then be made to allocate the storage for the array once the exact size is known. While underneath this would undoubtedly be implemented with the same pointers C uses, there is no good reason whatsoever why the programmer should be made painfully aware of it. The compiler can just as easily take care of the details, and the programmer would be free to use the variable as if it was any old array.
The discussion of arrays also brings to mind the subject of strings. No matter what anyone says, it is my firm belief that any language, regardless of its purpose, must have a powerful and flexible string-handling facility built-in. A program is very rare if it has no need for string handling, and I myself have had to write a great deal of programs, both at work and for my own uses, that depended heavily on strings. Some languages put strings in as an afterthought, and others put in some very basic features and leave the rest to library routines. That just can not be. The text string is a fundamentally important data type and can not be ignored, nor can it be relegated to blatant impersonation by some other type, such as array of characters. A string data type is required in a good language.
The very popular language C, and C++ as well, have horrendous string-handling facilities. Not only is the programmer required to declare his strings as character arrays, but there simply is no way to deal with strings as entities in the language. An assignment of one string to another is not possible. What is needed is the following piece of highly obfuscated code, taken directly out of the Kernighan and Ritchie C manual:
while (*s++ = *t++); (4)
Luckily this has been put into a convenient function that goes by the name strcpy() (the "o" must have been omitted for the sake of efficiency). The comparison operators can not be used on strings, not even on string constants, and another library routine must be used. Of course the routine returns either a negative, a 0 or a positive number, so code such as:
if(strcmp("String 1", "String 2") == 0)
DoSomething();
is needed to check if two strings are identical. To complicate matters further, the unsuspecting programmer who compares two string literals or two string variables directly is doomed to suffer from the fact that arrays are pointers, and even if the strings are the same, the comparison will fail, since they would not be stored at the same location. To someone writing a virus designed to destroy the boot sector, this lack of convenient string-handling wouldn't seem like a problem at all. The programmer writing the word processor that is going to be released to the public, however, must concern himself with string-handling a great deal.
Luckily other languages provide better facilities for dealing with strings. The Borland Pascal approach, while it does indeed emulate strings with arrays, hides the true nature of the string from the programmer by providing him with a STRING data type. Assignment, comparison and even concatenation of strings and string constants is allowed using the operators defined in the language, so the two pieces of code illustrated in C above can be written in Borland Pascal as follows:
s := t; IF 'String 1' = 'String 2' THEN DoSomething;
The code is plain, simple, natural and to the point. It does not involve learning any library procedures, nor does it require the inclusion of any header files to use strings. The drawback, of course, is that these strings are limited to a maximum length of 255 characters, whereas the C strings have no such limitation. The reason for the limit, while it may not be apparent, lies with the way Borland is storing the length of the string. As I mentioned, the string is an array, so the length is being stored as the very first element of the array. Because arrays have elements that are uniform in size, and a character only requires one byte, the same one byte is allocated for the length. The maximum value that a byte can have is 255, hence the seemingly arbitrary limitation. In most ordinary situations the 255-character string is more than enough to handle the needs of the programmer, but times arise when longer strings have to be manipulated. If the programmer is using Borland Pascal, he is forced to resort to the same trickery as the C programmer, using library routines and fooling with arrays of characters directly.
I see no good reason, other than making the job of the compiler writer easier, to not include a rich and powerful set of tools for dealing with strings as part of the language. It is the job of the compiler to jump through the proverbial hoops for the benefit of the programmers. When the programmer has to use awkward notation or, worse yet, write extra code, because the language designer decided not to put certain important features into the language to make the job of writing the compiler easier, the language becomes a problem, rather than a solution. The string data type must be included as a basic and fundamental type in the language, and it must allow for strings longer than 255 characters.
The easy solution to this problem is to use an array with each element the size of two bytes to represent the string. This way the length of a string up to 65,535 characters can be stored in the first element, and the null terminator stored in the last. While this approach would indeed double the space required to hold any given string, this problem seems not to have hampered the excitement in the programming and business community over Java. Because Java strings are made up of Unicode characters rather than ASCII characters, they require two bytes. The main advantage of this is that it allows for easy internationalization of programs, since Unicode includes characters for more languages than I can count. There are additional advantages to this approach, one of which is the easy and efficient representation of long strings with arrays, and the other is increased safety, a topic I will discuss later.
When talking about expressiveness of a language, one would be wrong not to mention loops. Various languages provide different loop facilities, but none are as powerful as they could be. The loops present in Pascal are adequate, but just that. The Pascal FOR loop gives the programmer the ability to go either forward or backward, with the increment or decrement of 1, and nothing else. If the programmer needs to iterate through some code with the increment of 2, he is forced to use a WHILE loop. The C for() loop is much more generic, but in providing this generality it introduces several unnecessary steps if all one needs is a simple loop with an increment of 1. It also introduces room for programmer error, because the programmer now has to carefully think whether the loop is to go from 1 to 10 inclusive, and if it is, how the condition should be written. The for() loop in C is very powerful when one needs a loop with an initialization, a check of some sort and a statement of some sort performed once at the end of each iteration. It does, however, give "the appearance of the for-statement being actually a compound of assignment, GOTO, and conditional GOTO, instead of being one of the elements of structured programming." (5)
I see no reason to sacrifice the power of the C for() loop just because it gives the wrong impression. It should not be used, however, when the programmer really does need a loop with a counter and a constant increment or decrement. Instead, the C for() loop should be used in those instances when there is an initialization to be performed and when there is a standard set of statements that must be executed at the end of each iteration of the loop. The C for() loop should be present in a good programming language, though under a different name, perhaps something as generic as just LOOP. The Pascal FOR loop should also be present, but in a slightly altered form. The alteration should make it similar to the FOR loop found in Basic, where the programmer can specify the value of the increment by adding a STEP clause to the loop.
Another rather useful construct that is applicable to loops, a construct which I have not actually seen in any of the languages I have used, is the situation case statement proposed by C.T. Zahn in 1974. This statement, the syntax of which is demonstrated below, allows the programmer to execute some code which could, but does not have to, be a loop, until a situation has occurred, and then, through the extension of the CASE syntax, to determine which situation caused the code to terminate and act accordingly. Any termination of the code other than by signaling a situation is illegal. This device is highly useful in handling things such as table searches and the like. The following example searches through the array A until it finds element E. TableFound and NoMatch are the signals, identifiers whose scope is only the situation case statement and which are implicitly declared in the UNTIL clause. X is a simple integer variable, and N holds the number of elements in the array. On the right is the same code in standard Pascal:
UNTIL TableFound OR NoMatch DO
FOR X := 1 TO N
IF A[X] = E THEN BEGIN
MatchIndex := X;
TableFound;
END;
NoMatch;
THEN CASE
TableFound: DoSomeFoundCode;
NoMatch: DoSomeNotFoundCode;
END; |
X := 1;
Found := FALSE;
WHILE NOT Found DO
IF A[X] = E THEN BEGIN
MatchIndex := X;
Found := TRUE;
END
ELSE
X := X + 1;
IF Found THEN
DoSomeFoundCode
ELSE
DoSomeNotFoundCode;
|
The code using the situation case statement is clearer and easier to follow. It states, right at the beginning, that the code will be executed until one of two conditions. It can use a standard FOR loop. At the end of the search process it has a case for each of the situations that can cause it to terminate. The code on the right, written in Pascal, requires the use of a WHILE loop (the C for() loop could be used, confusing the code further). It does not list all the possible cases at the top, but only specifies that the loop will continue until something is not found. The code used to analyze what happened is an IF-THEN-ELSE structure in a separate statement. The Found variable has to be declared as a boolean, and its scope is greater than the statement itself. But what is even more fundamentally important is that the situation case statement can handle more than two cases, as in the search example! The programmer can have 3, 4 or even 20 situations occur in the code. He is also guaranteed by the syntax of the construct that all possible causes for the code's termination are listed upfront and that one of the situations must happen if the code is to terminate. He is also guaranteed a handler for all possible situations, whereas someone simulating this could easily forget to write a handler. The situation case statement would clearly be a valuable addition to a language.
One final language construct I would like to mention is the conditional statement. This statement is a very useful way of specifying which of two expressions is to be evaluated based on the truth value of a boolean expression. Pascal lacks such a statement, while the syntax of the C version could use some improvement. Nonetheless, the conditional statement is a tool of significant convenience and ought to be included. Observe the extra code needed to simulate it:
A = (X < Y ? X : Y); |
IF X < Y THEN A := X ELSE A := Y; |
The above code does the same thing. It assigns the smaller of two variables, X and Y, to the variable A. The code on the left, written in C, is not just shorter than the Pascal code on the right, but allows for more possibilities, such as putting the conditional expression in the list of actual arguments to a function. Pascal would require an explicit temporary in this case, or the writing of a custom function. Doing either introduces issues that are not relevant to the overall program, takes time and opens up additional possibilities for the programmer to make a mistake. An expressive language would minimize these openings.
Even when the language provides a rich set of tools for writing a program, it is sometimes necessary to go around the compiler, to go around the language and deal with the basic underlying machine. Talk of high-level languages and of machine abstraction makes a great deal of sense in the classroom, except "in the real world you often find yourself forced to access actual machine addresses, or call subroutines contained in ROM at some machine address." (6) For that reason the language should have an escape facility through which the programmer could write assembly code when necessary. Nothing elaborate is needed in the language, short of a block structure that would instruct the compiler that enclosed statements are to be interpreted as assembler instructions, rather than statements in a high-level language. Implementing a language without this facility would be difficult, and certain system-level programming problems could not be solved without a way to include assembler code. While linking of externally compiled routines is a solution, it is not an acceptable one. Introduction of separate compilers into a programming project only opens the door for additional problems. The facility for linking of external code should be there, without a doubt, but it should not be the sole means of adding instructions in assembler to a program.
As great as assembler is, a language would be entirely useless if that was the only way to go beyond the facilities provided by the language itself. The programmer must have means for extending the language to suit his own needs. Even the casual programmer has his own set of routines that he uses in programs here and there, and software development companies have custom libraries developed by in-house programmers to solve the specific problems faced when writing software in their company. The perfect programming language will give the programmer the ability to write his own functions and procedures, to store them in his own libraries, and to blend them into the language so they seem natural to use and don't stick out as a sore thumb.
Of the language facilities that fit the criteria I outlined above, operator overloading is definitely an important one. The designers of Java, while making the language very similar to C++, decided that operator overloading was a "poorly understood, confusing and rarely used feature of C++." (7) I will not argue with their claim that it's a poorly understood or rarely used feature, for I have no evidence one way or another. I must, however, argue the confusing point. I see nothing the least bit confusing about operator overloading. The facility that allows the programmer to define the behavior of an operator on a data type for which the language does not define the operator is a very powerful and straightforward feature of a language, provided it is used correctly.
Since the language can't anticipate every data type that every programmer will ever create, it is perfectly natural that the language only defines operators on the types that are built into the language. When the programmer decides he needs to extend the language by creating a new data type, whether it be a class in object-oriented programming, or a structure, he should be given the ability to define how built-in operators behave when applied to this data type. To take that power away from the programmer, as Java did, or simply not give it to him at all, as Pascal does, forces the programmer to use awkward and unnatural syntax when dealing with custom data types. I will now demonstrate with an example.
As a programmer dealing with real-world problems, I've ran into situations where the numeric variable facilities provided by the language were simply not sufficient. The solution, of course, was to write my own set of routines to deal with integers larger than those provided by the language. Working in Object Pascal, it made perfect sense to build a class to bind the code with the data into one useful little package. The code written, however, I couldn't simply use my new integers as if they were indeed integers. The code that relied on them wouldn't even look like it was using integers, but would blatantly display itself as an add-on to the language, because every time I needed to perform even a simple arithmetic operation, such as add two large integers together, I had to call a procedure. If I had the operator overloading facility, however, I could have written the arithmetic operations so that the standard operators could be used to perform the standard procedures. I could use such notational conveniences as A + B, rather than resorting to A.Add(B). There would be no need for me to even know how this large integer type was implemented, because it would not only behave, but even look like, a built-in type. As a programmer I could happily forget about the inadequacy of the integer variables of the language, and continue using my custom type in the most natural way.
Any programmer who takes the language he programs in seriously and is serious about being able to extend it to suit his own needs should find operator overloading the most useful of features, not something that is rarely used. The inclusion of operator overloading in a language brings some important issues, however. One such issue is whether the programmer should be able to redefine the behavior of operators on existing data types. Another issue is whether the programmer should be allowed to create new operators. Another thorny issue is how the assignment operator is to be handled. An in-depth discussion of these issues is beyond the scope of this paper. I will say that redefinition of operators on existing data types is too good of a way to introduce confusion and should not be allowed, nor should the creation of new operators be allowed. The appropriate way to deal with assignment would probably involve a special syntax.
A wonderful feature that would go hand-in-hand with operator overloading has to do with object oriented programming. I have only seen the facility in Borland Delphi, and I don't know if it was developed by Borland specifically for Delphi, or if they borrowed the concept from elsewhere. My attempts to reach the Delphi development team at Borland were unsuccessful. Delphi allows the programmer to define properties in classes. A property is a special field name that has read and write attributes. Each of the attributes can specify either a member variable, or it can specify a method to be executed. Through this facility it is possible to create read-only or write-only field variables, among other useful applications. I will demonstrate by an example of a simple Delphi class:
TExample = CLASS
PRIVATE
FMyInt: INTEGER;
FMyString: STRING[50];
PROCEDURE SetString(S: STRING);
PUBLIC
PROPERTY MyInt: INTEGER READ FMyInt;
PROPERTY MyString: STRING[50] READ FMyString WRITE SetString;
END;
The class defines two private fields, FMyInt and FMyString. It also defines a private method SetString. The only public elements of the class, that is elements that can be seen and used by programmers that are working with the class, are MyInt and MyString, both properties. Properties can be used as if they were fields of an instance object, so something like I := E.MyInt, where I in an integer and E is an object of the class TExample, is a perfectly valid piece of code.
MyInt is a read-only property, since it does not define a WRITE attribute. This means that while MyInt can be used on the right-hand side of an assignment, or even as a parameter to a procedure, it can not be assigned to. This is a convenient way for the programmer to specify that he wants outside users to know the value of the field FMyInt, but doesn't want them to be able to change it. If a C++ or Java programmer wanted to do the same thing, he would have to write a function whose sole purpose would be to return the value of the field MyInt. The MyString property is far more interesting, however. It defines a property that, when read, simply returns the value of the field variable FMyString, but when written to, through the use of an assignment statement, executes a procedure. This marvelous way of setting the value of a field variable lets the programmer retain the convenience of the standard assignment statement while actually executing code in the background to do whatever is necessary when the field is being written to. While nothing would prevent the programmer from executing such code in C++ or Java, the user of the class would need to explicitly call a function to assign the value rather than use the good old assignment statement. Here is the declaration of the same class in Java:
class TExample {
private int MyInt;
private String MyString;
public int getMyInt() {
return MyInt;
}
public String getMyString() {
return MyString;
}
public void setMyString(String S) {
//some code goes here
}
}
In addition to explicitly calling a function to either read or write a field variable, the programmer needs to remember the different function names to be used for different things (an easy task, granted, if everyone sticks to the conventional way of naming the access methods). What follows is a comparison of how the TExample class can be used in Delphi and Java. Assume that I is an integer variable, S is a string, and E is an object of class TExample:
(Delphi) I := E.MyInt; E.MyString := E.MyString + 'stuff';
(Java)
I = E.getMyInt();
E.setMyString(E.getMyString().concat("stuff"));
I doubt anyone would argue with me when I say the Delphi code is shorter, cleaner, simpler and easier to understand.
The examples of Delphi properties above are very rudimentary. Delphi allows the programmer to do much more with properties. One can, for example, create array properties. Then the subscripts of these arrays can be any valid data type, from string to pointer to a floating point number, if that's what the application calls for. How the class stores the values and resolves the subscripts is of no concern to the user, because the property looks and behaves as a true array. An array property can also be declared as default, so that the object itself can be used as if it were an array, without explicitly naming the property.
A truly extensible language will let the programmer write procedures and functions that are as powerful as the library routines. That is why the library routines themselves (I am referring to the system library of the so-called built-in routines) should be written in the language. The Pascal WRITELN procedure, for example, is a wonderfully powerful tool for producing output because it can not only take a variable number of parameters, it can take parameters of any built-in type. The programmer can't write a procedure like that in the language, though. This makes the writing of truly generic routines difficult, and makes the implementation of WRITELN in Pascal itself impossible. If the language provided the programmer with the facility to dynamically determine the type and number of parameters, however, most, if not all, built-in procedures and functions could be written in terms of the language itself and the programmer would have an extremely powerful tool for extending the language to suit his own needs.
A possible way to implement this facility would be to provide a special type that could only be used in the formal parameter list of a function or procedure. Let us call this type PARAMETERS. Declaring a formal parameter of this type would let the caller specify any number of parameters of any type. The code for the procedure would then examine the actual parameters, because the formal parameter would have an attribute with the count of actual parameters, and would provide array-like access to them. An extension to the CASE structure could then be used to perform a different action based on the type of the parameter. The compiler would even be able to gather all the types for which a case has been defined, and disallow the passing of a type for which no action has been defined. The syntax could look something like this:
PROCEDURE WriteLN(P: PARAMETERS);
VAR
X: INTEGER;
BEGIN
FOR X := 1 TO P.Count DO
CASE P[X].Type OF
INTEGER: (* Do some code for integer parameters *)
STRING: (* Do some code for string parameters *)
END;
END;
Expressiveness and extensibility of a language are important features. Its readability and maintainability, however, is a facet of a language that is no less important. Common wisdom suggests that far more time is spent maintaining the code than is spent originally writing it. In order to maintain code, it must be easily readable, not only by the person who wrote it in the first place, but by others. It's hard enough to figure out what someone was thinking when he wrote the code, so the language itself should not present any additional difficulties in reading it. Sadly this fact has not always been recognized by language designers: "The objective of readability by human beings has sometimes been denied in favor of readability by a machine; and sometimes it has even been denied in favor of abbreviation of writing, achieved by a wealth of default conventions and implicit assumptions." (8) Anyone who disagrees with this point needs only to look at the famous language C, or its successors C++ and Java. Kernighan and Ritchie themselves don't even try to hide the fact. According to Kernighan "C and Ratfor programmers find begin and end bulky compared to { and }." (9) A quote from the C manual by the two summarizes the C syntax philosophy quite well: "The double equals sign == is the C notation for 'is equal to'. This symbol is used to distinguish the equality test from the single = used for assignment. Since assignment is about twice as frequent as equality testing in typical C programs, it's appropriate that the operator be half as long." (10)
This philosophy is fundamentally incorrect, the views of numerous die-hard C programmers notwithstanding. While brevity may have been a real concern in the days when programmers used two fingers and a card-punch machine to enter code, the modern environment presents an entirely different situation. Programmers can type fast, and the code is typed directly into the computer on the keyboard, without such intermediaries as punch cards. It takes but a millisecond longer, if that at all, to type out the word BEGIN as opposed to the symbol {. The difference in reading the code, however, is much greater. The words BEGIN and END stand out, jump at the reader from the page, indicating a block, whereas the tiny {} braces can get lost in volumes of code, even with proper indentation and syntax highlighting. I would be the first to say that a concise language is easier to program in than a wordy one, but concise does not mean cryptic, and that is precisely what C and C++ are.
The readability of a language depends on three things: syntax, syntax and syntax. A badly designed syntax not only increases the learning curve for the language, but also greatly increases the amount of time spent maintaining the code. Reading a program in a cryptic language is significantly harder than reading the same program in a concise while readable language. An important thing for the language designer to keep in mind is that his language may influence many others. One badly designed syntax can corrupt several generations of programmers and a number of good languages. C++ is badly hampered by C's ridiculous syntax, and Java has been irreparably affected as well, because Java designers fashioned Java "as closely to C++ as possible. This was done in order to make the system more familiar, more comprehensible, and to shorten the time necessary to learn the new language." (11) While the goal is admirable, the decision only proliferates the bad syntax of C, thereby making more programmers learn it and think it is the only way to program.
One aspect of a syntax that I have been particularly upset with is case sensitivity. C is a case-sensitive language, and so is C++ and now Java. Case sensitivity introduces a problem into a language on two levels. First, it dictates to the programmer the case of reserved words as well as built-in and library routines. Regardless of the programming style dictated by personal preferences or company policy, the programmer is required to use the exact same case for a great deal of the language as was used by the language designers. My personal programming style leans towards capitalizing reserved words and built-in functions (typing them in all caps) and typing my own function and variable names in mixed case. This is something I am not allowed to do when I program in C, C++ or Java. If I need to use the if construct, I must type it in lower case if I want the program to compile. This restriction certainly makes the job of writing the compiler easier by about one iota, but accomplishes nothing. The following comment about Modula-2 by David Spector sums my sentiments up: "Keywords must be capitalized, and case is significant in user identifiers. These limitations appear to offer little reward in return for the bother they cause." (12)
Not only do these limitations offer little reward for the bother, and it is a bother indeed, for a typing mistake such as writing a capital "A" instead of a lower case "a" produces a compile-time fatal error, but the introduction of case-sensitivity into a language opens the door for havoc. The programmer is now free to have identifiers with the exact same spelling, but different capitalization! Proper variable naming is essential in writing understandable and correct code. When the programmer is allowed to create situations where ABC is a class and abc is an integer, nothing but trouble can happen. The programmer will need to keep things straight in his own head, and anyone who reads the program will need to do so as well. Having the variable names sound the same also makes discussion of the program difficult. When speaking, it is necessary to not only say the name of the variable, but specify which variable one means.
A good programming language will not be case-sensitive. It will allow the programmer to use whatever capitalization conventions he prefers or are dictated by the project. It will also disallow the dangerous practice of using case to distinguish between variables, making the language safer and reader-friendly.
I could nitpick ad-infinitum, discussing the use of && and ! symbols instead of the more traditional AND and NOT forms and talking about redundant type specifiers in parameter lists. There are many aspects of the syntax that affect how the programmer works in the language, how readable it is and how easy it is to maintain. All of these must be taken into account when designing a language. The designer must focus on readability, rather than abbreviation of writing. Programmers can type fast enough to type out those extraordinarily long BEGINs and ENDs, and if they can't, they should learn -- it would make them more productive regardless of how wordy the language.
Maintaining a program is a task as difficult, if not more so, than writing the program in the first place. When bugs are found and corrected, programmers often introduce new bugs into the code. A programming language, while it can not completely eliminate it, must minimize the possibility of programmer error. After all, "making mistakes is human (particularly in programming), and we need all the help possible to avoid committing them." (13) It is to this effect that several features should be present in a good programming language, features that are sadly lacking in the popular languages such as Pascal and C.
One of these features is the statement-specific terminal. Both Pascal and C simply dispense with terminals for anything except a block, thus requiring statements such as if-then and loops to create a new block of code if more than one statement is to be executed. This not only makes it difficult to match up the ENDs or the closing braces } with the BEGINs and opening braces {, but can cause headaches with the most rudimentary of mistakes that even a seasoned programmer can fall victim to. The following code illustrates the problem of inserting additional statements into a pre-existing IF in Pascal:
IF BoolExpr THEN DoSomething; DoSomethingToo; (* Inserted during debugging *)
The indentation makes it painfully clear that the programmer intended for DoSomethingToo to be executed only if BoolExpr is TRUE, as part of the IF statement. Since the compiler does not read indentation, however, the resulting code executes DoSomethingToo in all cases, since the semicolon after DoSomething terminates the IF. To fix it the programmer must add a BEGIN and an END to the IF statement. The same problem plagues loops and cases.
An additional and even nastier mistake involves the so-called dangling else problem. This code illustrates:
IF BoolExpr1 THEN
IF BoolExpr2 THEN
DoSomething
ELSE
DoSomethingElse;
Again, the indentation makes it painfully clear that DoSomethingElse is to be executed when BoolExpr1 is FALSE, and only then. What the code does, however, is execute DoSomethingElse only if BoolExpr1 is TRUE and BoolExpr2 is FALSE, an entirely undesired effect. Neither Pascal nor C offer an easy way to solve this problem. The programmer either has to include a blank ELSE clause, or create a block for the inner IF statement. Java, with all its improvements to C++ resolves the dangling else problem "by connecting the else with the next most recent 'elseless' if statement in the block," (14) which is another way of saying that it does absolutely nothing to help solve it.
The only sensible solution to this problem is to require that statements such as IF-THEN and FOR have a terminator. ALGOL 68 provides exactly this, by closing IF statements with a FI and FOR-DO statements with an OD. This solution, while solving the problem, looks just as ugly as the empty ELSE clause or the BEGIN-END block, and on top of that sounds extremely funny. A pronounceable terminal is both more understandable and easier to pick out in code. An IF should be terminated with an ENDIF, a FOR with an ENDFOR, a WHILE with an ENDWHILE, etc. By terminating these types of statements the language eliminates the dangling else ambiguity (because the inner IF will always be closed with an ENDIF, the following ELSE can not possibly be associated with it), and solves the problem of inserting extra statements, since the BEGIN-END block will become unnecessary.
A problem that is somewhat similar in nature arises when programmers use the BREAK statement (while it doesn't exist in standard Pascal, Borland added it to its Pascal versions beginning with Turbo Pascal v7.0) to exit from loops. The lack of a BREAK statement in standard Pascal is one of the reasons Kernighan dislikes the language (9[b]), but its presence in C introduces as many problems as it solves. The biggest problem with the BREAK statement as it exists in current implementations of Pascal, C and C++ is that it only exits the innermost loop. There is no way to exit out of multiple enclosing loops. Problems can also arise with otherwise correct code if the programmer forgets about a BREAK and imbeds a loop in another.
The obvious solution to this problem is to allow the programmer to specify with a label what is to be broken. That is the attitude Java designers took by allowing the break to be used with a label, indicating that control is to be transferred to the place in the code where the label is. This is a surprisingly silly, not to mention bad, way to implement BREAK, despite the fact that some try to make light of the situation: "Those who mock good programming practices will be pleased to note that the labeled break statement can be made to act as the expunged-from-Java goto statement." (15) Java designers took out the goto with good reason, and yet they put it back in with this implementation of break, making Java more powerful than other languages and making it an invalid at the same time.
The more appropriate approach to handling the BREAK statement is requiring a label, and for the label to be placed on the loop to be broken, not the statement to which control is to pass. (1[b]) Control would then pass, quite naturally, to the statement following the loop, regardless of how it is terminated. This would not only remove the possibility of error when loops are inserted within others, but would allow an easy way to break out of several nested loops at once and would give advance warning to anyone reading the code that a loop can be broken out of. This requirement might put some extra burden on the programmer, but the cost is minimal when considering the benefits.
Readability, maintainability, extensibility and expressiveness of a language are all admirable features, but they are worthless if the language provides the programmer with large chasms into which he can fall with only a minor mistake. In the above discussion on readability and maintainability of code I mentioned features that help reduce the typical mistakes programmers make when writing code, such as forgetting a block enclosure or inserting a loop within another one without realizing the consequences. These kinds of mistakes can be avoided if the language provides the correct syntactic constructs. There are other types of common programming errors that arise not out of syntactic inadequacies of a language, but because the language simply does not do all that it could in helping prevent them. There is no or little new syntax to be added, but there is plenty of new code that has to be added to the compiler. A good programming language provides the programmer with an environment in which the programs he writes are more likely to be correct and to execute safely, and in which a program with a logical mistake will terminate with a run-time error, rather than run and produce mysterious results.
One of the most common mistakes programmers make is forgetting to initialize variables. Some languages, such as Basic, conveniently initialize all variables to 0 (or empty string). Others don't do anything to the variables. Sometimes the programmer is very lucky to spot the problem early on, when he gets peculiar-looking results, or when, perhaps by pure chance, the garbage value in a variable is such that it produces an error during computation. More often than not, however, the program appears to run just fine. Tracking down the problem is a tedious process, especially in a large program.
There have been some attempts to solve this problem by disallowing the programmer to use variables that have not been initialized. One method involves carrying run-time type information for the variable, and including a field in the type information that would keep track of the status of a variable. This approach is rather costly, since at least an additional byte of information is required for each and every variable. Furthermore, it slows down execution, since this method would require a check of the status bit whenever the variable was used. Another method involves compile-time checking of variable status. Through some simple flow analysis and an additional status bit or two in the name table, it is possible to know at compile time whether a variable has a garbage value or an actual usable value (16). This type of flow analysis can get extraordinarily complicated, however, in a large language such as Pascal or C.
The optimal solution to disallowing the use of uninitialized variables in a program, and thus saving the programmer a great deal of time is a combination of the two methods. Some simple compile-time flow analysis must be performed to determine at which point in code every variable is guaranteed to either have been assigned to or its value used. No compile- or run-time checking is required in code beyond that point. At the same time, every variable will be initialized by the compiler to a known garbage value. In the portion of code where a variable is not guaranteed to have been assigned to or read from, every use of the variable will require a simple equality check to see if its value is the known garbage value, and if it is, a run-time error or an exception is generated. This solution is not quite as outlandish as it may seem on the surface, as nearly every variable type can be fitted with a garbage value without sacrificing its usefulness.
String variables would, most likely, be the easiest to deal with. Providing strings use Unicode characters, a sensible approach as described earlier, all string variables can be set to have the value FFFF in their first character. The Unicode standard guarantees that FFFF is not a valid Unicode character (17), so none of the functionality of the string is lost. No more than several machine language instructions would need to be added to test for this condition when the string was being used, and since the test would not have to be performed every single time, the burden on the code would not be great at all. Boolean variables are just as easy to handle as strings, being that out of the whole byte (or however much more space they take up) only two values are being used for the type itself: 0 representing FALSE and 1 representing TRUE. The C approach to having any non-zero value represent TRUE would certainly not be allowed, both because it would interfere with assigning a known garbage value to a boolean, and because it is a generally bad practice to have several actual values mean the same thing. Any value other than 0 and 1, then, could be used to represent the known garbage value. FF seems as good a choice as any.
Integer variables are much more difficult to deal with than strings or booleans, however, because every single bit pattern is used to represent some number. Signed integer variables usually have a range that is larger by one value on the negative side than on the positive side. A 16-bit integer, for example, can hold values from -32768 to 32767. The reason for this is that with the sign bit being either a 0 or a 1, there seem to be two values for 0, a -0 and a +0 when the rest of the bits are 0. The solution to the problem is to use two's compliment notation, in which a negative variable is inverted and a 1 is added to it to obtain its value. Two's compliment notation interprets the bit string 1000000000000000 as -32768. This is not only unsightly, because there is no symmetry, but also takes away a wonderful method for spotting garbage values in integers. The easiest solution would have been to have the hardware, the CPU, to treat that value as a garbage value, throwing some kind of garbage flag, or at least indicate an underflow/overflow if it was obtained through computation. Instead CPU manufacturers chose to treat it as a valid numeric value. That was a fundamental mistake.
The loss of one possible value in an integer variable would not bring great pains to the programmer. The enormous advantage gained by having the hardware detect the use of a garbage value would by far outweigh any minor difficulties that may arise in some software projects. Until Intel and Motorolla come to their senses, however, this detection must be done in software. Checking for the required bit pattern is relatively painless, and requires only a few additional instructions. The trick is in utilizing the CPU flags. Most processors have a flag that indicates carry and a zero as a result of the last arithmetic operation. After any arithmetic operation the carry must be taken care of. Next, the value that was just obtained is shifted left by one bit. If the value in the register was in fact a 1 with all following bits 0, the value that is used to indicate garbage and that, as a consequence, can not be used in any arithmetic, then the carry and the zero flags will be set. Two successive conditional jumps take care of skipping over the instruction that raises an exception or causes a run-time error. The value can then be restored to whatever it was by rotating right with carry. The following example of Intel 8088 assembly code illustrates. Assume that the value being tested is in the AX register:
SHL AX,1
JNC ValueOk
JNZ ValueOk
CALL GarbageValue
ValueOk: RCR AX,1
After the shift, the zero and carry flags are set if the only bit in the AX register that was set to 1 was the sign bit. If both flags are set the value is not valid, and the subroutine GarbageValue is called. If at least one of the flags is not set, the code jumps to the label ValueOk, where the original value of the AX register is restored and execution can continue as if nothing has happened. It is true that this type of check will have to be performed after every arithmetic operation to enforce the restriction on the range, and it will slow execution down. Just how much it will slow down a compiled program is not certain, and significant tests would need to be performed. But I feel the gain far outweighs the loss in speed. A few extra clock cycles here and there are not going to be problematic. Efficiency in a program is not gained by saving a couple of clock cycles every couple of lines: ". . .in spite of the tremendous increase in machine power and in spite of overwhelming evidence that software quality and programmer productivity are severely impaired by a misguided concern for efficiency, programmers compulsively continue to count clock-cycles." (18)
Uninitialized values in pointers are also a bit tricky to handle. The null pointer is certainly an indication of an invalid address, and any attempts to dereference that should be caught. It is, however, a very valid value for the variable that has uses in programming. The trick is to eliminate pointers as we know them from a language. Use of pointers in high-level languages has been debated for many years. Kernighan and Ritchie admit they're bad: "Pointers have been lumped with the goto statement as a marvelous way to create impossible-to-understand programs." (19) C.A.R. Hoare agrees: "CITerences are like jumps, leading wildly from one part of a data structure to another. Their introduction into high-level languages has been a step backwards from which we may never recover." (8[b])
But one can not simply take away pointers, sit back and say that a great service to the programming community has just been performed. Pointers, no matter how dangerous, are an integral part of programming. Pointers are used by application programmers for dynamic memory allocation and by systems programmers for communication with the operating system and the hardware. It is necessary to examine just how pointers are used and what language features could be implemented to take the place of pointers, while giving the programmer optimum safety and convenience.
I will not look at all the things C programmers do with pointers, because that would give anyone the wrong impression that these things should be done with pointers. Despite the explanation Kernighan and Ritchie give for the extensive use of pointers in C: "Pointers are very much used in C, partly because they are sometimes the only way to express a computation, and partly because they usually lead to more compact and efficient code than can be obtained in other ways," (19[b]) there are only a few spots where pointer use can truly be justified.
One such use is for dynamic memory allocation in linked lists. If the language provides the programmer with the list facility that I discussed earlier, however, the need for explicit pointers in linked lists disappears. Even doubly-linked lists can easily be implemented with a built-in list system, where the compiler assumes the list is doubly-linked, and the optimizer then removes the backwards link for any list where the feature is not used by the programmer. Trees would be slightly more difficult to represent. That is where the object-oriented features of the language could come in handy, with a TTree class defined in the system library, with several specific tree classes derived from that, such as TBinaryTree, for example. Inheritance and polymorphism could make these trees incredibly more powerful than the crude constructs a C or Pascal programmer can build with records and pointers.
The use of pointers in C to imitate dynamic arrays has already been discussed, and a solution proposed. The dynamic array simply becomes a feature of the language, and the need to tread the thin line of impersonation by pointers disappears. Other uses of pointers for dynamic memory allocation can also be hidden from the programmer. Instances of classes can be allocated on the heap without requiring the programmer to explicitly declare a pointer, as in Delphi, for example. The variable that holds the object is a reference, but the programmer can not manipulate its contents and its dereferencing is implicit.
The use of pointers by systems programmers to communicate with the hardware or with other programs is something that can not be easily handled without pointers. For these instances the language must give the programmer facilities for overlaying a data structure on a specific machine address. This facility would function much like the typed pointers of Pascal, except the programmer would not be able to allocate storage for these reference variables. This can be accomplished by allowing the programmer to declare a reference to a type and then to specify the address at which the data resides. The dereferencing would be done implicitly anywhere the variable is used, and the address manipulation would require a different syntax. Additionally the programmer would be able to determine the address of any data structure for passing to a hardware or operating system routine. While this approach would still leave the programmer with the ability to rummage around in system memory, it does not require him to do so for the trivial things. A hacker writing a virus could explicitly access various parts of memory and put whatever he wishes there, but the unsuspecting programmer would find it very difficult to do it by accident.
I would be amiss if I did not mention bounds checking in a discussion on safety and correctness. In C and C++ array bounds checking is not performed. Whether it was done purely out of that misguided concern for clock cycles or if it was done to allow the various abuses of the "array is a pointer" pun is irrelevant. It is very obvious, however, that this approach opens up a slew of possibilities for the programmer to make mistakes. The worst part is that a mistake in indexing an array will lead to corruption of data and may even crash the machine. The programmer should very simply not be allowed to address the 25th element of a 24-element array! The element does not exist! There is no reason for the programmer to want to address that element. Such a reference is clearly an error in logic and should be caught by the language. Array bounds checking is now easier than ever, with processors providing special instructions for this much-needed task. There simply can be no excuse for not doing bounds checking.
There are many places in a program where the programmer can make a mistake. Working with loops is one such place. With the C for() loop, due to its nature, one always knows the value of the loop control variable after the loop has ended -- it is obvious from the condition. In another language, where the statement is less obviously a special case of the WHILE loop, the value of the loop control variable after the loop has terminated is uncertain. This omission in the design of the language forces the programmer to resort to trickery of all sorts, from not using FOR loops at all, to using temporary variables and adding extra assignments and conditional tests to the loop body. In addition to the above problem, some languages do not even define where in the course of the loop the test is performed and where the incrementing is done, giving the programmer yet one more thing to worry about.
If that wasn't bad enough, the loop control variable behaves just like a regular variable inside the loop, and the programmer can easily change its value by accident. Again, in C, provided the loop condition is written appropriately, such a change may just terminate the loop. In the case of a shorter condition or in the case of another language, where the FOR loop just specifies the starting and ending values, such a change could send the program into an infinite loop. In the worst case, the program may continue to function with seemingly correct results.
The answer to these problems is an extension of the mechanism used in ALGOL 68 and Ada. The loop control variable would have to be a heretofore unused variable. It would not be declared like other variables. Its occurrence in the FOR statement would implicitly declare a "constant" of the type of the parameters to the loop (integers, characters, etc.) with the scope of the body of the loop itself. This way the compiler can easily enforce the restriction that the loop control variable be never assigned to or passed as a variable parameter to a function. This would also mean that the loop control variable would not be available outside the loop so the programmer would never have to guess what the value would be on loop termination: the maximum value specified in the FOR range or one greater? To avoid the use of explicit temporaries to obtain the value of the loop control variable on loop termination, the loop would become an expression. The value of the expression is the value of the loop control variable at the time of loop termination. To take all uncertainty out of the value question, the value of the loop control variable is set to be the garbage value before the loop is executed. This means that if the body is never executed, the variable has no value. A special test would have to be added to the language so that a variable can be checked for a garbage value without raising an exception or causing a run-time error. Perhaps a built-in boolean function with the name similar to ASSIGNED can be used for this purpose. Whenever the loop terminates, the value returned will be the value of the loop control variable before incrementing and testing the condition. This means that if the loop goes from 1 to 5, the value will be returned as 5 when the loop finally stops. This is the only sensible approach, since setting the value to be one greater than the largest value specified in the range creates problems if that largest value is indeed the largest integer that can be represented.
The construct suggested above would give the programmer peace of mind, assuring him that he can't change the value of the loop control variable by accident, and giving him an easy and natural way to obtain the value of the loop control variable at the time of loop termination.
Name hiding is an issue that has also been debated in the circles of computer science for quite some time. Many a programmer has fallen victim to this trap. Many languages, including C and Pascal, allow the programmer to declare variables that have the same name as those declared in an outer scope. For example, a program that has a global variable G can also have a local variable G in any of its functions. On the one hand this can be quite useful, since it doesn't require the programmer to invent new variable names in certain cases. However, it is very easy to forget the fact that declaring a local variable G makes the global variable G inaccessible. An attempt to use G in the local code, then, accesses the local one, even if the programmer intended to access the global one.
I have also stumbled across a problem in Pascal that is related to this. I can not say if the problem is in the language definition or in the way Borland implemented it (I have only seen the problem in the Borland implementations), but it is possible to declare a procedure or function that has the same name as a built-in procedure or function. The result is a very pernicious situation where the unsuspecting programmer can not access a very much needed built-in procedure. What's worse, is that half the time this happened to me, I had no clue what was going on and had to spend valuable time trying to determine just why the compiler was giving me strange error messages.
There are two solutions to this problem. Either name hiding should be forbidden, thus preventing the programmer from declaring a local variable if a global one with the same name exists, or if such a variable is declared, scope resolution would be required to identify exactly which one is needed. It is my opinion that name hiding is a serious problem that must be eliminated completely. Scope resolution only creates more confusion, introducing new possibilities for programmer error. There is no reason for the programmer to create local variables that have the names of global variables, nor is there any reason for him to create functions or procedures with names that are already in use. While it is true that if they are of different types it is possible for the compiler to easily distinguish between the two, but the confusion this would create in the mind of the programmer or the person reading the program is not worth the extra compiler code.
Correctness and safety of a program are often impacted by error-handling facilities of a language. Some languages provide no error-handling facilities whatsoever, forcing the programmer to write boolean functions or add extra parameters to procedures or insert extra code to check for error. Other languages provide elaborate exception handling mechanisms to let the programmer easily signal, detect and recover from errors. I do not believe there will be any argument when I say that exception handling is a necessary party of any good language. It is certainly possible to write perfectly good programs in a language that does not support exception handling at all, but writing these programs takes more time, they are usually longer, more difficult to understand and have more errors simply because there is more possibility for error on the part of the programmer. The value of exception-handling in a language is easily demonstrated from the following two code examples. On the left is Turbo Pascal, while on the right is Delphi, which has an exception mechanism. Assume that F is a file variable:
RESET(F);
IF IORESULT <> 0 THEN
IOError;
I := 0;
WHILE NOT EOF(F) DO BEGIN
INC(I)
READ(F, A[I]);
IF IORESULT <> 0 THEN
IOError;
END;
CLOSE(F);
IF IORESULT <> 0 THEN
IOError;
|
TRY
RESET(F);
I := 0;
WHILE NOT EOF(F) DO BEGIN
INC(I);
READ(F, A[I]);
END;
CLOSE(F);
EXCEPT
ON EInOutError DO
IOError;
END;
|
It is true that the Delphi code on the right is only one line shorter than the Turbo Pascal code on the left. It is, however, more straight-forward. First of all it declares right at the beginning that the block of code that is to follow is going to handle exceptions. Checks for the error condition are not interspersed in the code itself, but one single check is instead present at the bottom of the block. This structure lets the programmer write the necessary code without having to worry where he has to check for errors, and write the error-handling code once. It is the job of the language to transfer control to the proper code if an error occurs.
Another useful mechanism that goes hand-in-hand with exception handling is a so-called resource protection block. In Delphi such a block is declared with a TRY statement as well, but instead of the EXCEPT clause at the end there is a FINALLY clause. Should an exception be raised anywhere in the block, the statements in the FINALLY section of the code are executed before control is passed to outer blocks for exception handling. Sadly there is no way to combine resource protection blocks with exception handling blocks, except to nest them. A good way to approach this would be to allow a combination resource protection and exception handling block, allowing the programmer to both deal with an exception and to perform necessary clean-up tasks, such as close files.
Any discussion on safety and correctness would be incomplete without mentioning garbage collection. Memory deallocation has always been a major headache for programmers working in languages without garbage collection. In order to reclaim used memory extra code has to be written, including all types of complicated loops and logical conditions. Every programmer has had to deal with hunting down memory leaks, a task about as pleasurable as a root canal. Garbage collection has really not been much used in real-time applications and in languages designed for real-time programs because it slows things down considerably. With faster machines and a great number of good garbage collection schemes available, the excuse is no longer valid. The popularity of Java seems not to have been hampered by the fact that garbage collection is part of the language.
Exactly how garbage collection is to be implemented is not something that should be part of a language specification. The language should require that garbage collection be performed, and leave the details to the compiler writer. What should be defined is the behavior of the garbage collector. An important issue regarding this behavior is raised if one looks at systems programming and the need to maintain constant physical addresses for certain variables. This becomes particularly crucial when considering the garbage collection approach in which memory is divided in half and objects are copied from one half to the other on a regular basis, compacting the store and overwriting memory that's no longer used by other objects. To this end the language must allow the programmer to specify, at compile time, which variables will be untouched by the garbage collector. This can either be done by adding modifiers to variable declarations or by creating whole blocks of code where the garbage collector is simply turned off.
The last quality of a good programming language is the portability of code. This is a highly important issue when considering that not everyone uses IBM-compatible machines running the same operating system. A language must make the task of moving from one physical machine to another as painless as possible for the programmer. There are a number of issues that have to be dealt with in discussing this feature, from standard variable sizes to input/output routines to order of evaluation.
Standard variable size is perhaps one of the most important issues in portability across platforms, and yet it has not really been addressed in so-called portable languages. The great language C, despite being (supposedly) portable: "C is not tied to any particular hardware or system, however, and it is easy to write programs that will run without change on any machine that supports C" (20), fails miserably when it comes to variable size. For example, an integer is 16 bits on the DEC PDP-11, while it is 32 bits on the IBM 370 (21). Despite the suggestion by Kernighan and Ritchie that these differences in variable size aren't as much of a problem as might be thought (21[b]), they present a fatal flaw in the design of a portable language. True, if all one is dealing with is a program that interacts with the user and doesn't access any files, then the problem is minimal. If one has to access files, however, a serious problem arises. How is a C program running on the IBM 370 to read a binary file created by a C program that was running on the DEC PDP-11, when an integer occupies a different number of bits? Also, how is the program to handle the fact that different machines store bits in different order? On some machines the high-order byte is stored before the low-order byte in a word, and on others it's the opposite.
The only solution to these problems is to do what Java did, and that is to explicitly define the size of every variable and to enforce that size in all implementations, regardless of whether the machine supports that size natively or not. If on one machine an integer is 32 bits, on no other machine can it be anything but 32 bits, or it will be entirely impossible to move the program from one machine to another without changing the types of all variables.
A standard library of input/output routines is also essential to program portability. Learned computer scientists can ridicule Pascal's I/O all they want, but they can not argue with the fact that it is highly conducive to portability for the programmer to rely on a certain set of routines to be used for accessing the external machine, regardless of what that machine might be. If I want to read a file on an IBM mainframe or an Intel-based personal computer, I can call the same RESET and READ routines and not have to worry about anything else. The language must define the interface by which the program can communicate with the physical machine, and this interface must be present in its entirety in every implementation. The routines should be written in the language itself and tailored to the actual hardware they are being used for, but their behavior must be defined by the language and simply can not vary from implementation to implementation, or the programmer can forget about accessing real-world files.
The issue of order of evaluation is one that could fit in any of the categories I've discussed above, but I believe it is best discussed in the portability section, because portability refers not only to different physical machines, but also to different compilers on the same machine. Many languages, including C and C++, do not specify the order of evaluation in an expression. In most languages this isn't a problem, but in languages that allow side-effects in expressions it can wreak havoc. Consider these classical examples of the problem in C:
i = 5; foo(i, i++); |
i = 5; k = i - --i; |
Here the function foo() is called with two parameters. The question, however, is whether it receives the values 5 and 5 or the values 6 and 5 when it is executed. There is no way to answer the question without knowing which compiler the code will be compiled with. The same goes for the example on the right, which can assign either the value 1 or the value 0 to the variable k, depending on which compiler one uses.
There is plenty of precedent for undefined order of evaluation and undefined order of side-effects, both in computer science and in mathematics. In computer science it is often efficient to determine the order of evaluation on the fly, and nothing in mathematics suggests that it is wrong, for A + B does indeed equal B + A. There are larger issues in programming, however. The key point that has to be kept in mind is that "in no case can [programming errors] be allowed to give rise to machine or implementation dependent effects." (8[c]) It is very true that it is probably bad programming practice to rely on the order of evaluation or on the order of side effects, but if the language allows one to write these expressions, their value must not be allowed to change from compiler to compiler! Efficiency of compilation or calculation aside, it is, in my opinion, necessary for the language to define the order of evaluation in all cases, so the programmer has one less thing to worry about.
It is clear that none of the currently popular languages go far enough in terms of providing the programmer with an expressive and extensible environment. They certainly don't go far enough in making reading and maintaining code easier. They could do much more to make it easier to write correct and safe programs the first time. They certainly could use improvement in the area of portability. The way to solve all these problems is not by starting from scratch and developing a whole new language, for that would be reinventing the wheel. The solution lies in picking out the powerful and useful elements of the languages out there and re-packaging them with several new ideas into a different language, a language that places emphasis on expressiveness, so programmers don't have to find ways to go around the compiler or write extra code because the language designer didn't include some trivial features; a language that is easily extensible, so the programmer doesn't find himself using awkward notation to express basic concepts; a language that is easy to read and maintain, so the programmer doesn't have to keep a bottle of Tylenol next to the computer when reading the code; a language that makes it possible to write correct programs that behave in predictable ways on the first try; a language that makes it easy to move the program to another compiler or another hardware platform with few if any modifications to the source code.
These ideals have not been achieved, at least not in any one language, but they are possible without great hardship for either the language designer, the compiler writer or the programmer. Sure, a new language will mean re-learning the syntax and the features and changing the way we think, but it will not be because it will introduce some new and radical way of doing things, but because it will let the programmer think in the most natural way, something we have not been able to do with the languages available to us.
1.
This is not ML syntax, but the suggested syntax of the list
declaration in the imaginary ideal language.
1 1(b). Coar, David. "Pascal, Ada and Modula-2." Byte, Aug 1984: 215-232.
2. Kernighan, Brian W. and Ritchie, Dennis M. The C Programming Language. Englewood Cliffs: Prentice-Hall, 1978: 186.
3. Moody, R.P. "C in Education and Software Engineering." SIGCSE Bulletin, Sep 1991: 45-56.
4. Kernighan and Ritchie: 101.
5. Maurer, W.D. "More on the Programming Language C." Communications of the ACM, Sep 1995: 108-110.
6. Johnson, Mark and Munro, Allen. "Pascal's Design Flaws: Modula-2 Solutions and Pascal Patches." Byte, Mar 1984: 371-388.
7. Newman, Alexander, et al. Special Edition: Using Java. Indianapolis: Que, 1996: 17.
8 8(b) 8(c). Hoare, C.A.R. "Hints on Programming Language Design." Ed. Ellis Horowitz. Programming Languages: A Grand Tour. Rockville: Computer Science Press, 1985: 31-40.
9 9(b). Kernighan, Brian W. Why Pascal is Not My Favorite Programming Language. Murray Hill: Bell Laboratories, 1981.
10. Kernighan and Ritchie: 17.
11. Newman: 16.
12. Spector, David. "Ambiguities and Insecurities in Modula-2." Programming Languages: A Grand Tour. Rockville: Computer Science Press, 1983: 163-169.
13. Wirth, Niklaus. "On The Design of Programming Languages."
Programming Languages: A Grand Tour. Rockville: Computer Science Press, 1985: 23-30.14. Newman: 251.
15. Newman: 255.
16. Zahn, Charles T. "Flow Analysis and Code Improvement." Programming Language Implementation: An Introduction. Pleasantville: Pace University, 1988: 13.1 - 13.27.
17. Unicode Consortium. The Unicode Standard Version 2.0. Reading: Addison-Wesley, 1996.
18. Moody.
19 19(b). Kernighan and Ritchie: 89.
20. Kernighan and Ritchie: IX.
21 21(b). Kernighan and Ritchie: 181-182.
Abel, Peter. IBM PC Assembly Language and Programming. Englewood Cliffs: Prentice Hall, 1991.
Anderson, Jim and Fishman, Barry. "The Smalltalk Programming Language." Byte, May 1985: 160-165.
Borland Delphi for Windows: Component Writer's Guide. Scotts Valley: Borland International, 1995.
Duff, Charles B. "Designing an Efficient Language." Byte, Aug 1986: 211-224.
Ed. Ellis Horowitz. Programming Languages: A Grand Tour. Rockville: Computer Science Press, 1983, 1985.
Iverson, Kenneth E. "Notation as a Tool of Thought." ACM Turing Award Lectures: The First Twenty Years. Reading: Addison-Wesley, 1987.
Johnson, L.F. "C in the First Course Considered Harmful." Communications of the ACM, May 1995: 99-101.
Kastanas, Ilias. "In Friendly Disagreement." Communications of the ACM, Sep 1995: 110-111.
Sumner, Roger T. and Gleaves, R.E. "Modula-2 -- A Solution to Pascal's Problems." ACM SIGPLAN Notices, Sep 1982: 28-33.
Tanenbaum, Andrew S. Structured Computer Organization. Englewood Cliffs: Prentice-Hall, 1990.
Wichman, Brian A. "Is ADA too big? A Designer Answers the Critics." Communications of the ACM, Feb 1984: 98-103.
Wirth, Niklaus. "Design and Implementation of Modula." Software: Practice and Experience, Vol 7, 1997: 67-84.
Zahn, Charles T. A Control Statement For Natural Top-Down Structured Programming. Geneva: CERN, 1974.
I would like to thank the following people and organizations, without whom this project would never have been completed:
Dr. Mary Courtney, for challenging me to stop criticizing C++ and do something about it.
Professor Charles Zahn, for his assistance in obtaining research materials and for the time spent discussing some of the points with me.
Mrs. J. Botknecht at the Mortolla Library Interlibrary Loan for helping me get my hands on some of the more difficult to obtain materials.
Dr. Marie Bradshaw, without whom I would not be in the Honors Program.
Sr. Doretta Cornell for her encouragement in the Honors Program.
Dr. Allen Stix for taking on the task of being my advisor on this project.
The Oceanside Public Library and the Sciences and Technologies Library at the University of California at San Diego, where I was able to obtain some useful research materials.
Chris Ferony, who undertook the daunting task of proofreading this paper.
Go back to my programming page
Agree, disagree or otherwise make a comment about this page