CHAPTER 11 ----- PROCEDURE DIVISION

Probably the most important concept in the PROCEDURE DIVISION is to make the program structure match the underlying structure of the data that it is processing. This single guiding principle, when effectively applied, will cure most of the ills of coding the COBOL logic. Michael Jackson argued persuasively in 1975 that for many common tasks in business data processing, a thorough understanding of the input and output data can almost automatically lead to code.1 Although the particulars change, the theme remains: before writing code, good programmers thoroughly understand the input, the output, and the intermediate data structures around which their programs are built.

This concept was thoroughly discussed by E.W. Dijkstra in 19762 and applied to COBOL programming in 1981 by Barry Dwyer.3

However, this has not always been the case. COBOL coding began in the early 1960's, 15 years before Jackson became a major proponent of this concept. Earlier constructs were modeled more after earlier assembler languages and other languages such as FORTRAN and were more concerned with machine efficiency that with the data structures being processed. But even now, 14 years after some of these principles were introduced, many coding constructs from the 1960's are still being followed.

The most insidious example of this is the two-file matching logic. The primary control statement in early versions of FORTRAN was the three-way IF statement. (The logical IF statement did not come until later). This statement evaluated a numeric expression and took one of three branches - less than zero, zero, and greater than zero. [Incidentally, this is also the root of the concept that all file "keys" must be numbers, e.g. customer number, part number, etc.] This three-way matching of variables carried over into COBOL, and was used in the matching of "keys" of the two files being matched. "If the master-key is less than the transaction-key then ..., if the master-key is the same as the transaction-key then ..., if the master-key is greater than the transaction-key then ..." This logic continues to be "common knowledge" as "the way" to match two files even now.

However, what if you ask people using this logic how to match three files? How about four? Five? They give you a blank stare and think you must be crazy to even want to try it. But with the proper logic structure, these situations are no harder than the two-file situation. Let's look at the underlying data structure of the generalized n-file situation (a master file and several different transaction files, all sorted by the same key).

Each file is composed of records with keys, and each file (except the master file generally) may have multiple records with the same key. The problem is not one of "matching" the keys across the various files, but one of processing each key in order, regardless of which file(s) it comes from. This particular construct is sometimes called the "Balanced Line Algorithm". This logic, in pseudo-COBOL, looks like the code on the following page.

The Balanced Line Algorithm for file matching

initialization.
read one record from each file.
PERFORM CHOOSE-NEXT-KEY.
PERFORM PROCESS-ONE-KEY

UNTIL NEXT-KEY = HIGH-VALUES.
termination.

read-file-1-record.
READ file-1 INTO file-1-layout
AT END MOVE HIGH-VALUES TO file-1-key.

(same for other files)

CHOOSE-NEXT-KEY.
MOVE file-1-key TO NEXT-KEY.
IF file-2-key < NEXT-KEY
MOVE file-2-key TO NEXT-KEY.
IF file-3-key < NEXT-KEY
MOVE file-3-key TO NEXT-KEY.
.
.
.

PROCESS-ONE-KEY.
PERFORM process-file-1
UNTIL file-1-key NOT = NEXT-KEY.
PERFORM process-file-2
UNTIL file-2-key NOT = NEXT-KEY.
.
.
.
PERFORM CHOOSE-NEXT-KEY.

process-file-1.
.
.
.
PERFORM read-file-1-record.

(same for other files)

The key point of this code is that the number of IF statements controlling the logic is one less than the number of files, e.g. there are only 4 IF statements if we are processing 5 files. In fact, contrary to "common belief", this is not a file matching problem at all, but a simple one of processing a series of keys in sequence. While the "greater than - equal to - less than" logic may work for only two files, the "choose next key" logic shown in this example is far superior.

This is especially true when you need to do other things in the program besides just match two files, for example, print a report of the contents of the new master file. Adding additional files makes the historical model break down quickly.

RULE 11 - Use the Balanced Line algorithm for file "matching" instead of the key matching algorithm.

This same historical context of comparing keys is the basis for the common method of performing "control breaks". This is generally done with an IF statement at the beginning of the main processing paragraph. If there are multiple level of breaks, the rule is to test for a "break" in the major control field first, the intermediate control field next, and the minor control field last, of course remembering to include execution of the minor break and the intermediate break in the major break logic, and to include the minor break logic in the intermediate break logic.

Again, if we look at the underlying structure of the data being processed, a better way to accomplish this is apparent. A file is not, for example, simply composed of records for different divisions, stores, and departments. Rather, the file is an ordered list of divisions, each division is an ordered list of stores, each store is an ordered list of departments, and each department has multiple detail records within it. The file can be depicted logically as follows (the "[...]" is a repetition indication):

Company [Division [Store [Department [Details]]]]

The corresponding report might have the following logical structure:

Company [Division [Store [Department [Details] Department total] Store total] Division total] Company total

We can align these structures as follows:

Co [Div [St [Dept [Det] ] ] ]

Co [Div [St [Dept [Det] total] total] total] total

From this structure, two things are apparent - (1) there is no structure clash between the two structures, and (2) rather than a "break", groupings are composed both of things which come before the details (e.g. printing the store name) and things which come after the details (e.g. printing store totals). Pseudo-COBOL code to handle this situation is below.

.

.

PERFORM COMPANY-INITIALIZATION.
PERFORM PROCESS-DIVISION
UNTIL NO-MORE-DATA.
PERFORM PRINT-COMPANY-TOTALS.
.
.
.
PROCESS-DIVISION.
PERFORM DIVISION-INITIALIZATION.
MOVE IN-DIVISION TO CURRENT-DIVISION.
PERFORM PROCESS-STORE
UNTIL NO-MORE-DATA
OR IN-DIVISION NOT = CURRENT-DIVISION.
PERFORM PRINT-DIVISION-TOTALS.
.
.
.
PROCESS-STORE.
PERFORM STORE-INITIALIZATION.
MOVE IN-STORE TO CURRENT-STORE.
PERFORM PROCESS-DEPARTMENT
UNTIL NO-MORE-DATA
OR IN-DIVISION NOT = CURRENT-DIVISION
OR IN-STORE NOT = CURRENT-STORE.
PERFORM PRINT-STORE-TOTALS.
.
.
.
PROCESS-DEPARTMENT.
PERFORM DEPARTMENT-INITIALIZATION.
MOVE IN-DEPT TO CURRENT-DEPT.
PERFORM PROCESS-DETAIL
UNTIL NO-MORE-DATA
OR IN-DIVISION NOT = CURRENT-DIVISION
OR IN-STORE NOT = CURRENT-STORE
OR IN-DEPT NOT = CURRENT-DEPT.
PERFORM PRINT-DEPARTMENT-TOTALS.
.
.
.
PROCESS-DETAIL.
.
.
PERFORM READ-INPUT-FILE.

Several advantages come from this revised structure. (1) Performing things at the beginning of a "group" (e.g. skipping to a new page) is just as simple as performing things at the end of a "group" (e.g. printing group totals). In fact, things like zeroing out totals is a natural thing to do in the group initialization routine, rather than remembering to do them in the global program initialization routine as well as after each total has been printed out. (2) "Rolling totals" upward is a fairly easy thing to do. The "last time" break at the end of the file no longer requires additional code. (3) The complexity introduced by testing for control breaks with IF statements has been eliminated. It is interesting to note that the "mainline" logic that accumulates grand totals is almost always coded using this logic structure, even when the sub-totals are done using a different structure.

RULE 12 - Use the PERFORM UNTIL structure instead of the historical control break structure.

The previous example also illustrates another common coding structure - the so-called "triform program".4 This is a simple extrapolation of the fact that most programs

process repeating groups of data. Therefore, it is almost obvious that the matching program logic to process this type of data is:

(1) Do whatever needs to be done at the beginning of a group.

(2) Process each element of the group in a loop structure.

(3) Do whatever needs to be done at the end of a group.

Therefore, we can expect most programs to have a main logic paragraph that looks like:

             PERFORM xxxxx-INITIALIZATION.
PERFORM xxxxx-PROCESS
UNTIL terminating-condition.
PERFORM xxxxx-TERMINATION.

In fact, the multi-level control program illustrated earlier is just several of these structures nested within one another. There are however, two points of contention about this structure. (1) If the xxxxx-INITIALIZATION or xxxxx-TERMINATION procedures are sufficiently small, it is sometimes suggested that they be coded "in-line", that is to simply code them in place of the PERFORM noted. (2) The words INITIALIZATION and TERMINATION may be objectionable to some. However, apart from these two minor points, if most of our programs were to be structured this way, then they could be more easily understood by others. The author does not object to either of these points.

RULE 13 - Use the "triform" structure as the main control structure in your program.

The preceding examples have shown ways that the main logic flow of a program can be done without the usual IF statements that are often found in a program. However, there are other instances where IF statements are necessary. Unfortunately, there is also a lot of misconception about the use of the IF statement.

Back before the advent of structured code, when much code that was being produced was so-called "spaghetti code", programmers were still somewhat concerned about having their code be understandable. Since their primary control structure was the IF statement, there was a certain amount of effort made to control its use. In particular, people tried to avoid nested IF statements since they were difficult to read and understand (especially when coded without the indentation rules of today).

However, there are many instances of "nested" IF statements where what is being implemented is not so much of a complex structure, but a variation of the CASE structure of structured programming. For example:

IF MARITAL-STATUS-IS-SINGLE
PERFORM SINGLE-ROUTINE
ELSE IF MARITAL-STATUS-IS-DIVORCED
PERFORM DIVORCED-ROUTINE
ELSE IF MARITAL-STATUS-IS-WIDOWED
PERFORM WIDOWED-ROUTINE
ELSE
PERFORM BAD-MARITAL-STATUS-ROUTINE.

This is called a nested IF and, under modern indentation rules, is often coded as:

             IF MARITAL-STATUS-IS-SINGLE
PERFORM SINGLE-ROUTINE
ELSE
IF MARITAL-STATUS-IS-DIVORCED
PERFORM DIVORCED-ROUTINE
ELSE
IF MARITAL-STATUS-IS-WIDOWED
PERFORM WIDOWED-ROUTINE
ELSE
PERFORM BAD-MARITAL-STATUS-ROUTINE.

This second example fails to show the fact that each branch of the IF is mutually exclusive and exhaustive, just as the structured CASE statement denotes. In fact, as the number of possible paths grows (for example, several different transaction codes), this latter structure gets out of control even if the indentation level is reduced.

The failure is to recognize that there is a difference between the typical "nested IF" statement, and the use of multiple IF statements to implement a CASE structure. In the CASE structure, it is almost as if there was an ELSEIF verb, as the IF for each path always follows the ELSE for the path above it. When coded properly, there is no additional indentation needed, and the code is easy to understand. Some call these two forms "linear" and "non-linear" nesting. [In COBOL-85 the EVALUATE verb can be used to implement the CASE structure instead.]

RULE 14 - Do not indent when coding a "linear" nested IF to implement a CASE structure. Code the ELSE IF on the same line as if it were a single verb.

Another construct that often confuses the issue when using nested IF statements is the NEXT SENTENCE structure. These are often only used when absolutely necessary in order to get a series of IF and ELSE pairs to match up properly. However, this can have two other roles: (1) to force the developer to consider what really belongs on the other path of the IF statement, and (2) to show a consistency of approach to the maintenance programmer when he or she is examining the code. Especially with nested IF statements, being always able to depend on having an ELSE NEXT SENTENCE properly aligned with the IF statement, even if not strictly necessary, makes the code easier to read, e.g. the level of indentation taken by the various nested IF statements is always matched by "outdentation" as the matching ELSE statements are coded. This can have a very definite effect on reducing the difficulty in understanding nested IF statements.

RULE 15 - When coding nested IF statements, always code both the true and false paths, using the NEXT SENTENCE or ELSE NEXT SENTENCE construct as necessary.

There is one other coding construct where the IF verb is used incorrectly. When there is the possibility of a routine being performed zero times. historically some have used the following code construct:

             IF NOT condition THEN
PERFORM routine UNTIL condition.

An example of this is:

             IF NOT END-OF-FILE THEN
PERFORM MAIN-PROCESS UNTIL END-OF-FILE.

This construct is not only overly complex, but is also less efficient. A better construct is:

            PERFORM MAIN-PROCESS UNTIL END-OF-FILE.

In the first example, when the END-OF-FILE condition is true, the IF statement is evaluated once and the remainder of the sentence is skipped. Since the COBOL PERFORM verb is a "test before" construct, in the second example the condition is evaluated once and the PERFORM statement terminates without any iterations. Thus there is no difference in efficiency in either case.

However, in the first example, when the END-OF-FILE condition is false, the IF statement is evaluated, the need to iterate is recognized, and the loop is performed the appropriate number of times, with an evaluation of the condition preceding each iteration. Thus when the loop is iterated "n" times, there are "n+1" tests of the loop control condition in addition to the single IF condition. In the second example, the initial IF condition is missing and the loop condition is simply evaluated "n+1" times. Thus the second example is more efficient for the machine, in addition to being more efficient for the programmer to write.

However, the first example was often used historically even in structured programming because programmers had a distrust and misunderstanding of the PERFORM statement (since they previously used the GO TO construct). However, a proper understanding of the "test first" in the PERFORM statement shows that the simpler construct is the better one.

RULE 16 - Do not code an IF statement when the terminating condition of a PERFORM loop will also include the condition.

In all the above examples, all the PERFORM statements have been simple ones, i.e. no PERFORM ... THRU constructs have been shown. However, many people still continue to use the PERFORM ... THRU construct and to recommend it to others. In fact, the programming standards at some installations may require its use. This also has some historical basis.

Before structured programming, most programmers avoided the PERFORM statement altogether because it was not "efficient". When they converted to using structured code and the PERFORM statement came into vogue, there was still a lot of misunderstanding about how to code certain common logic constructs using structured code. For example, there was much concern about need for the so-called "early exit". The following COBOL construct was fairly common:

             PERFORM MAIN-ROUTINE THRU MAIN-EXIT
UNTIL END-OF-FILE.
.
.
MAIN-ROUTINE.
READ file AT END
MOVE 'YES' TO END-OF-FILE-SWITCH.
IF END-OF-FILE
GO TO MAIN-EXIT.
.
.
.
MAIN-EXIT.
EXIT.

The need to terminate the routine because some other condition was reached necessitated an "early exit" from the routine. This construct seemed to crop up a lot when a programmer who was used to non-structured coding converted to structured coding. Therefore it was recommended that the PERFORM ... THRU construct always be used since it allowed early exits both when needed in situations as above and when needed during program maintenance later on. However, as we have come to better understand how to use the structured coding constructs, it has also been shown that such "early exits" are not necessary. The above construct could have been coded:

             READ file AT END
MOVE 'YES' TO END-OF-FILE-SWITCH.
PERFORM MAIN-ROUTINE
UNTIL END-OF-FILE.
.
.
MAIN-ROUTINE.
.
.
READ file AT END
MOVE 'YES' TO END-OF-FILE-SWITCH.

It is usually easy to develop a modified routine for each such "early exit" example. Thus the PERFORM ... THRU construct is not really needed. The avoidance of this construct gives several other benefits. (1) The programmer never forgets to code the THRU construct, thus avoiding inadvertent fall-thru logic which leads to difficult to debug situations. (2) The GO TO statement is not needed and it is then not a temptation to use in uncontrolled ways other than the "early exit" situation. (3) The programmer needs to do less coding, thus making for shorter and more understandable programs. (4) There is no longer a temptation to code statements in the "exit" paragraph other than the EXIT verb.

One argument for the PERFORM ... THRU construct is that the EXIT paragraph is a positive delineation that the end of the routine has been reached. However, the beginning of a new paragraph is equally such a delineation of the end of the previous routine.

RULE 17 - Do not use the PERFORM ... THRU construct.

The original purpose of the FILE SECTION was to describe all the data that was contained in files, where the WORKING-STORAGE SECTION was to describe "work" variables that did not exist in the files. This was also an efficient way to describe the data when computer memory was limited. But with the increasing size of computers, there are many advantages to describing all data in the WORKING-STORAGE SECTION, and to only using the FILE SECTION to describe the actual files. (These are enumerated in the next section which examines COBOL textbooks.) Thus an FD entry might be:

         FD  file-name.
01 record-name PIC X(...).

The actual layout of the record would be contained in WORKING-STORAGE. Since we are then always referencing variables in WORKING-STORAGE, there should no longer be a need for any form of the READ or WRITE verb except READ ... INTO and WRITE ... FROM.

RULE 18 - Always use the READ ... INTO and WRITE ... FROM forms of these verbs and define all record definitions in the WORKING-STORAGE SECTION.

There are many advantages to using program "controls": (1) for verifying that the program is working properly, (2) as an indication of the volume of data that the program is processing, and (3) to verify that the proper files were being passed from one program to another. Some installations standards specify that all programs must maintain control totals. This is generally a good standard.

However, there is sometimes a question of what the program should be controlling. Often this question is to simply count the records being read and the records being written. This will often satisfy all three criteria mentioned above. This is often implemented in many simpler programs by counting all the records being read, counting all the records being written, printing these two counts on a "control report" and noting that these two counts are equal.

With the use of proper structured coding constructs and the development of logically simpler programs, many programs will need little verification of working properly. Also, the author has observed many instances where the error indicated in the "control report" was not a problem of the program not working properly, but the counters not being incremented properly. If the counting can go wrong as often as the program, perhaps another approach is needed.

We need to ask ourselves, just why am I counting, and what should I be counting? A solution that may meet several needs is to count the records as they are processed, rather than count them as they are read. Look at the following two examples:

    example 1:

PERFORM READ-FILE.
PERFORM MAIN-ROUTINE
UNTIL END-OF-FILE.
.
.
READ-FILE.
READ file AT END
MOVE 'YES' TO END-OF-FILE-SWITCH.
IF NOT END-OF-FILE
ADD 1 TO RECORDS-READ.
.
.
MAIN-ROUTINE.
.
.
PERFORM READ-FILE.

example 2:

PERFORM READ-FILE.
PERFORM MAIN-ROUTINE
UNTIL END-OF-FILE. .
.
READ-FILE.
READ file AT END
MOVE 'YES' TO END-OF-FILE-SWITCH.
.
.
MAIN-ROUTINE.
ADD 1 TO RECORDS-PROCESSED.
.
.
PERFORM READ-FILE.

Both of these examples will end up counting the same number of records. However, the second example eliminates the problem of forgetting the "IF NOT END-OF-FILE" when counting the reads. (The author has seen examples of initializing the read-counter to -1 or of subtracting 1 during the termination routine to compensate for the AT END.) Also, since most of the other counting of various paths taken, sub-totals, etc., is in the MAIN-ROUTINE, it may make more sense to "count the records" there as well.

RULE 19 - Consider counting "records processed" instead of "records read". At least ask yourself, "why am I counting?" and "what am I counting?".

Finally, there are situations where a program could easily do more than one thing, or where the ability to do special things during testing to aid the testing process would be beneficial. This has often been done in the past by (1) getting the program to work for one situation, then copying it and creating a "near-clone" to handle the second similar situation, or (2) inserting extra code during testing, then removing it [hopefully!] before the final compile of the "production" program.

However, both of these situations could be handled by including the additional code and making it "conditional". This conditional code would be turned on through the use of a run-time parameter (either with a "PARM=" on the EXEC card, or through some other method). This would eliminate the following problems: (1) having to make a change to the program and also having to make the same change to the "near-clone" program, (2) forgetting to eliminate the "test" code before making the "production" version, and (3) having to re-insert the "test" code when needing to test revisions to the program at a later time. (The author has also used this technique to allow bypassing all the actual file modification code, thus allowing testing of update programs against the live production files without actually updating them). A small amount of foresight when designing a program can make a large amount of difference later on.

RULE 20 - Consider adding "PARM" overrides to allow for easy program testing, eliminated the need for "near-clones", etc.

CHAPTER 11 ENDNOTES

1 - Michael Jackson. Principles of Program Design.

2 - E.W. Dijkstra. A Discipline of Programming.

3 - Barry Dwyer. One More Time - How to Update a Master File.

4 - Dan W. Crockett. Triform Programs.

Previous Chapter ----- Return to Index ----- Next Chapter