Probably the most important concept in the PROCEDURE DIVISION
is to make the program structure match the underlying structure
of the data that it is processing. This single guiding principle,
when effectively applied, will cure most of the ills of coding
the COBOL logic. Michael Jackson argued persuasively in 1975
that for many common tasks in business data processing, a thorough
understanding of the input and output data can almost automatically
lead to code.1 Although the particulars change,
the theme remains: before writing code, good programmers thoroughly
understand the input, the output, and the intermediate data structures
around which their programs are built.
This concept was thoroughly discussed by E.W. Dijkstra in
19762 and applied to COBOL programming in
1981 by Barry Dwyer.3
However, this has not always been the case. COBOL coding
began in the early 1960's, 15 years before Jackson became a major
proponent of this concept. Earlier constructs were modeled more
after earlier assembler languages and other languages such as
FORTRAN and were more concerned with machine efficiency that with
the data structures being processed. But even now, 14 years after
some of these principles were introduced, many coding constructs
from the 1960's are still being followed.
The most insidious example of this is the two-file matching
logic. The primary control statement in early versions of FORTRAN
was the three-way IF statement. (The logical IF statement did
not come until later). This statement evaluated a numeric expression
and took one of three branches - less than zero, zero, and greater
than zero. [Incidentally, this is also the root of the concept
that all file "keys" must be numbers, e.g. customer
number, part number, etc.] This three-way matching of variables
carried over into COBOL, and was used in the matching of "keys"
of the two files being matched. "If the master-key is less
than the transaction-key then ..., if the master-key is the same
as the transaction-key then ..., if the master-key is greater
than the transaction-key then ..." This logic continues to
be "common knowledge" as "the way" to match
two files even now.
However, what if you ask people using this logic how to match
three files? How about four? Five? They give you a blank stare
and think you must be crazy to even want to try it. But with
the proper logic structure, these situations are no harder than
the two-file situation. Let's look at the underlying data structure
of the generalized n-file situation (a master file and several
different transaction files, all sorted by the same key).
Each file is composed of records with keys, and each file
(except the master file generally) may have multiple records with
the same key. The problem is not one of "matching"
the keys across the various files, but one of processing each
key in order, regardless of which file(s) it comes from. This
particular construct is sometimes called the "Balanced Line
Algorithm". This logic, in pseudo-COBOL, looks like the
code on the following page.
The Balanced Line Algorithm for file matching
initialization.
UNTIL NEXT-KEY = HIGH-VALUES.
read-file-1-record.
(same for other files)
CHOOSE-NEXT-KEY.
PROCESS-ONE-KEY.
process-file-1.
(same for other files)
The key point of this code is that the number of IF statements
controlling the logic is one less than the number of files, e.g.
there are only 4 IF statements if we are processing 5 files.
In fact, contrary to "common belief", this is not a
file matching problem at all, but a simple one of processing a
series of keys in sequence. While the "greater than - equal
to - less than" logic may work for only two files, the "choose
next key" logic shown in this example is far superior.
read one record from each file.
PERFORM CHOOSE-NEXT-KEY.
PERFORM PROCESS-ONE-KEY
termination.
READ file-1 INTO file-1-layout
AT END MOVE HIGH-VALUES TO file-1-key.
MOVE file-1-key TO NEXT-KEY.
IF file-2-key < NEXT-KEY
MOVE file-2-key TO NEXT-KEY.
IF file-3-key < NEXT-KEY
MOVE file-3-key TO NEXT-KEY.
.
.
.
PERFORM process-file-1
UNTIL file-1-key NOT = NEXT-KEY.
PERFORM process-file-2
UNTIL file-2-key NOT = NEXT-KEY.
.
.
.
PERFORM CHOOSE-NEXT-KEY.
.
.
.
PERFORM read-file-1-record.
This is especially true when you need to do other things in
the program besides just match two files, for example, print a
report of the contents of the new master file. Adding additional
files makes the historical model break down quickly.
RULE 11 - Use the Balanced Line algorithm for file "matching"
instead of the key matching algorithm.
This same historical context of comparing keys is the basis
for the common method of performing "control breaks".
This is generally done with an IF statement at the beginning of
the main processing paragraph. If there are multiple level of
breaks, the rule is to test for a "break" in the major
control field first, the intermediate control field next, and
the minor control field last, of course remembering to include
execution of the minor break and the intermediate break in the
major break logic, and to include the minor break logic in the
intermediate break logic.
Again, if we look at the underlying structure of the data
being processed, a better way to accomplish this is apparent.
A file is not, for example, simply composed of records for different
divisions, stores, and departments. Rather, the file is an ordered
list of divisions, each division is an ordered list of stores,
each store is an ordered list of departments, and each department
has multiple detail records within it. The file can be depicted
logically as follows (the "[...]" is a repetition indication):
Company [Division [Store [Department [Details]]]]
The corresponding report might have the following logical
structure:
Company [Division [Store [Department [Details] Department
total] Store total] Division total] Company total
We can align these structures as follows:
Co [Div [St [Dept [Det] ] ] ]
Co [Div [St [Dept [Det] total] total] total] total
From this structure, two things are apparent - (1) there is
no structure clash between the two structures, and (2) rather
than a "break", groupings are composed both of things
which come before the details (e.g. printing the store name) and
things which come after the details (e.g. printing store totals).
Pseudo-COBOL code to handle this situation is below.
.
.
PERFORM COMPANY-INITIALIZATION.
PERFORM PROCESS-DIVISION
UNTIL NO-MORE-DATA.
PERFORM PRINT-COMPANY-TOTALS.
.
.
.
PROCESS-DIVISION.
PERFORM DIVISION-INITIALIZATION.
MOVE IN-DIVISION TO CURRENT-DIVISION.
PERFORM PROCESS-STORE
UNTIL NO-MORE-DATA
OR IN-DIVISION NOT = CURRENT-DIVISION.
PERFORM PRINT-DIVISION-TOTALS.
.
.
.
PROCESS-STORE.
PERFORM STORE-INITIALIZATION.
MOVE IN-STORE TO CURRENT-STORE.
PERFORM PROCESS-DEPARTMENT
UNTIL NO-MORE-DATA
OR IN-DIVISION NOT = CURRENT-DIVISION
OR IN-STORE NOT = CURRENT-STORE.
PERFORM PRINT-STORE-TOTALS.
.
.
.
PROCESS-DEPARTMENT.
PERFORM DEPARTMENT-INITIALIZATION.
MOVE IN-DEPT TO CURRENT-DEPT.
PERFORM PROCESS-DETAIL
UNTIL NO-MORE-DATA
OR IN-DIVISION NOT = CURRENT-DIVISION
OR IN-STORE NOT = CURRENT-STORE
OR IN-DEPT NOT = CURRENT-DEPT.
PERFORM PRINT-DEPARTMENT-TOTALS.
.
.
.
PROCESS-DETAIL.
.
.
PERFORM READ-INPUT-FILE.
Several advantages come from this revised structure. (1)
Performing things at the beginning of a "group" (e.g.
skipping to a new page) is just as simple as performing things
at the end of a "group" (e.g. printing group totals).
In fact, things like zeroing out totals is a natural thing to
do in the group initialization routine, rather than remembering
to do them in the global program initialization routine as well
as after each total has been printed out. (2) "Rolling totals"
upward is a fairly easy thing to do. The "last time"
break at the end of the file no longer requires additional code.
(3) The complexity introduced by testing for control breaks with
IF statements has been eliminated. It is interesting to note
that the "mainline" logic that accumulates grand totals
is almost always coded using this logic structure, even when the
sub-totals are done using a different structure.
RULE 12 - Use the PERFORM UNTIL structure instead of the historical
control break structure.
The previous example also illustrates another common coding structure - the so-called "triform program".4 This is a simple extrapolation of the fact that most programs
process repeating groups of data. Therefore, it is almost obvious
that the matching program logic to process this type of data is:
(1) Do whatever needs to be done at the beginning of a group.
(2) Process each element of the group in a loop structure.
(3) Do whatever needs to be done at the end of a group.
Therefore, we can expect most programs to have a main logic
paragraph that looks like:
PERFORM xxxxx-INITIALIZATION.
PERFORM xxxxx-PROCESS
UNTIL terminating-condition.
PERFORM xxxxx-TERMINATION.
In fact, the multi-level control program illustrated earlier
is just several of these structures nested within one another.
There are however, two points of contention about this structure.
(1) If the xxxxx-INITIALIZATION or xxxxx-TERMINATION procedures
are sufficiently small, it is sometimes suggested that they be
coded "in-line", that is to simply code them in place
of the PERFORM noted. (2) The words INITIALIZATION and TERMINATION
may be objectionable to some. However, apart from these two minor
points, if most of our programs were to be structured this way,
then they could be more easily understood by others. The author
does not object to either of these points.
RULE 13 - Use the "triform" structure as the main
control structure in your program.
The preceding examples have shown ways that the main logic
flow of a program can be done without the usual IF statements
that are often found in a program. However, there are other instances
where IF statements are necessary. Unfortunately, there is also
a lot of misconception about the use of the IF statement.
Back before the advent of structured code, when much code
that was being produced was so-called "spaghetti code",
programmers were still somewhat concerned about having their code
be understandable. Since their primary control structure was
the IF statement, there was a certain amount of effort made to
control its use. In particular, people tried to avoid nested
IF statements since they were difficult to read and understand
(especially when coded without the indentation rules of today).
However, there are many instances of "nested" IF
statements where what is being implemented is not so much of a
complex structure, but a variation of the CASE structure of structured
programming. For example:
IF MARITAL-STATUS-IS-SINGLE
PERFORM SINGLE-ROUTINE
ELSE IF MARITAL-STATUS-IS-DIVORCED
PERFORM DIVORCED-ROUTINE
ELSE IF MARITAL-STATUS-IS-WIDOWED
PERFORM WIDOWED-ROUTINE
ELSE
PERFORM BAD-MARITAL-STATUS-ROUTINE.
This is called a nested IF and, under modern indentation rules,
is often coded as:
IF MARITAL-STATUS-IS-SINGLE
This second example fails to show the fact that each branch
of the IF is mutually exclusive and exhaustive, just as the structured
CASE statement denotes. In fact, as the number of possible paths
grows (for example, several different transaction codes), this
latter structure gets out of control even if the indentation level
is reduced.
PERFORM SINGLE-ROUTINE
ELSE
IF MARITAL-STATUS-IS-DIVORCED
PERFORM DIVORCED-ROUTINE
ELSE
IF MARITAL-STATUS-IS-WIDOWED
PERFORM WIDOWED-ROUTINE
ELSE
PERFORM BAD-MARITAL-STATUS-ROUTINE.
The failure is to recognize that there is a difference between
the typical "nested IF" statement, and the use of multiple
IF statements to implement a CASE structure. In the CASE structure,
it is almost as if there was an ELSEIF verb, as the IF for each
path always follows the ELSE for the path above it. When coded
properly, there is no additional indentation needed, and the code
is easy to understand. Some call these two forms "linear"
and "non-linear" nesting. [In COBOL-85 the EVALUATE
verb can be used to implement the CASE structure instead.]
RULE 14 - Do not indent when coding a "linear" nested
IF to implement a CASE structure. Code the ELSE IF on the same
line as if it were a single verb.
Another construct that often confuses the issue when using
nested IF statements is the NEXT SENTENCE structure. These are
often only used when absolutely necessary in order to get a series
of IF and ELSE pairs to match up properly. However, this can
have two other roles: (1) to force the developer to consider what
really belongs on the other path of the IF statement, and (2)
to show a consistency of approach to the maintenance programmer
when he or she is examining the code. Especially with nested
IF statements, being always able to depend on having an ELSE NEXT
SENTENCE properly aligned with the IF statement, even if not strictly
necessary, makes the code easier to read, e.g. the level of indentation
taken by the various nested IF statements is always matched by
"outdentation" as the matching ELSE statements are coded.
This can have a very definite effect on reducing the difficulty
in understanding nested IF statements.
RULE 15 - When coding nested IF statements, always code both
the true and false paths, using the NEXT SENTENCE or ELSE NEXT
SENTENCE construct as necessary.
There is one other coding construct where the IF verb is used
incorrectly. When there is the possibility of a routine being
performed zero times. historically some have used the following
code construct:
IF NOT condition THEN
PERFORM routine UNTIL condition.
An example of this is:
IF NOT END-OF-FILE THEN
PERFORM MAIN-PROCESS UNTIL END-OF-FILE.
This construct is not only overly complex, but is also less
efficient. A better construct is:
PERFORM MAIN-PROCESS UNTIL END-OF-FILE.
In the first example, when the END-OF-FILE condition is true,
the IF statement is evaluated once and the remainder of the sentence
is skipped. Since the COBOL PERFORM verb is a "test before"
construct, in the second example the condition is evaluated once
and the PERFORM statement terminates without any iterations.
Thus there is no difference in efficiency in either case.
However, in the first example, when the END-OF-FILE condition
is false, the IF statement is evaluated, the need to iterate is
recognized, and the loop is performed the appropriate number of
times, with an evaluation of the condition preceding each iteration.
Thus when the loop is iterated "n" times, there are
"n+1" tests of the loop control condition in addition
to the single IF condition. In the second example, the initial
IF condition is missing and the loop condition is simply evaluated
"n+1" times. Thus the second example is more efficient
for the machine, in addition to being more efficient for the programmer
to write.
However, the first example was often used historically even
in structured programming because programmers had a distrust and
misunderstanding of the PERFORM statement (since they previously
used the GO TO construct). However, a proper understanding of
the "test first" in the PERFORM statement shows that
the simpler construct is the better one.
RULE 16 - Do not code an IF statement when the terminating
condition of a PERFORM loop will also include the condition.
In all the above examples, all the PERFORM statements have
been simple ones, i.e. no PERFORM ... THRU constructs have been
shown. However, many people still continue to use the PERFORM
... THRU construct and to recommend it to others. In fact, the
programming standards at some installations may require its use.
This also has some historical basis.
Before structured programming, most programmers avoided the
PERFORM statement altogether because it was not "efficient".
When they converted to using structured code and the PERFORM
statement came into vogue, there was still a lot of misunderstanding
about how to code certain common logic constructs using structured
code. For example, there was much concern about need for the
so-called "early exit". The following COBOL construct
was fairly common:
PERFORM MAIN-ROUTINE THRU MAIN-EXIT
The need to terminate the routine because some other condition
was reached necessitated an "early exit" from the routine.
This construct seemed to crop up a lot when a programmer who
was used to non-structured coding converted to structured coding.
Therefore it was recommended that the PERFORM ... THRU construct
always be used since it allowed early exits both when needed in
situations as above and when needed during program maintenance
later on. However, as we have come to better understand how to
use the structured coding constructs, it has also been shown that
such "early exits" are not necessary. The above construct
could have been coded:
UNTIL END-OF-FILE.
.
.
MAIN-ROUTINE.
READ file AT END
MOVE 'YES' TO END-OF-FILE-SWITCH.
IF END-OF-FILE
GO TO MAIN-EXIT.
.
.
.
MAIN-EXIT.
EXIT.
READ file AT END
MOVE 'YES' TO END-OF-FILE-SWITCH.
PERFORM MAIN-ROUTINE
UNTIL END-OF-FILE.
.
.
MAIN-ROUTINE.
.
.
READ file AT END
MOVE 'YES' TO END-OF-FILE-SWITCH.
It is usually easy to develop a modified routine for each
such "early exit" example. Thus the PERFORM ... THRU
construct is not really needed. The avoidance of this construct
gives several other benefits. (1) The programmer never forgets
to code the THRU construct, thus avoiding inadvertent fall-thru
logic which leads to difficult to debug situations. (2) The GO
TO statement is not needed and it is then not a temptation to
use in uncontrolled ways other than the "early exit"
situation. (3) The programmer needs to do less coding, thus making
for shorter and more understandable programs. (4) There is no
longer a temptation to code statements in the "exit"
paragraph other than the EXIT verb.
One argument for the PERFORM ... THRU construct is that the
EXIT paragraph is a positive delineation that the end of the routine
has been reached. However, the beginning of a new paragraph is
equally such a delineation of the end of the previous routine.
RULE 17 - Do not use the PERFORM ... THRU construct.
The original purpose of the FILE SECTION was to describe all
the data that was contained in files, where the WORKING-STORAGE
SECTION was to describe "work" variables that did not
exist in the files. This was also an efficient way to describe
the data when computer memory was limited. But with the increasing
size of computers, there are many advantages to describing all
data in the WORKING-STORAGE SECTION, and to only using the FILE
SECTION to describe the actual files. (These are enumerated in
the next section which examines COBOL textbooks.) Thus an FD
entry might be:
FD file-name.
01 record-name PIC X(...).
The actual layout of the record would be contained in WORKING-STORAGE.
Since we are then always referencing variables in WORKING-STORAGE,
there should no longer be a need for any form of the READ or WRITE
verb except READ ... INTO and WRITE ... FROM.
RULE 18 - Always use the READ ... INTO and WRITE ... FROM forms
of these verbs and define all record definitions in the WORKING-STORAGE
SECTION.
There are many advantages to using program "controls":
(1) for verifying that the program is working properly, (2) as
an indication of the volume of data that the program is processing,
and (3) to verify that the proper files were being passed from
one program to another. Some installations standards specify
that all programs must maintain control totals. This is
generally a good standard.
However, there is sometimes a question of what the program
should be controlling. Often this question is to simply count
the records being read and the records being written. This will
often satisfy all three criteria mentioned above. This is often
implemented in many simpler programs by counting all the records
being read, counting all the records being written, printing these
two counts on a "control report" and noting that these
two counts are equal.
With the use of proper structured coding constructs and the
development of logically simpler programs, many programs will
need little verification of working properly. Also, the author
has observed many instances where the error indicated in the "control
report" was not a problem of the program not working properly,
but the counters not being incremented properly. If the counting
can go wrong as often as the program, perhaps another approach
is needed.
We need to ask ourselves, just why am I counting, and what
should I be counting? A solution that may meet several needs
is to count the records as they are processed, rather than count
them as they are read. Look at the following two examples:
PERFORM READ-FILE.
example 2:
PERFORM READ-FILE.
example 1:
PERFORM MAIN-ROUTINE
UNTIL END-OF-FILE.
.
.
READ-FILE.
READ file AT END
MOVE 'YES' TO END-OF-FILE-SWITCH.
IF NOT END-OF-FILE
ADD 1 TO RECORDS-READ.
.
.
MAIN-ROUTINE.
.
.
PERFORM READ-FILE.
PERFORM MAIN-ROUTINE
UNTIL END-OF-FILE.
.
READ-FILE.
READ file AT END
MOVE 'YES' TO END-OF-FILE-SWITCH.
.
.
MAIN-ROUTINE.
ADD 1 TO RECORDS-PROCESSED.
.
.
PERFORM READ-FILE.
Both of these examples will end up counting the same number
of records. However, the second example eliminates the problem
of forgetting the "IF NOT END-OF-FILE" when counting
the reads. (The author has seen examples of initializing the
read-counter to -1 or of subtracting 1 during the termination
routine to compensate for the AT END.) Also, since most of the
other counting of various paths taken, sub-totals, etc., is in
the MAIN-ROUTINE, it may make more sense to "count the records"
there as well.
RULE 19 - Consider counting "records processed" instead
of "records read". At least ask yourself, "why
am I counting?" and "what am I counting?".
Finally, there are situations where a program could easily
do more than one thing, or where the ability to do special things
during testing to aid the testing process would be beneficial.
This has often been done in the past by (1) getting the program
to work for one situation, then copying it and creating a "near-clone"
to handle the second similar situation, or (2) inserting extra
code during testing, then removing it [hopefully!] before the
final compile of the "production" program.
However, both of these situations could be handled by including
the additional code and making it "conditional". This
conditional code would be turned on through the use of a run-time
parameter (either with a "PARM=" on the EXEC card, or
through some other method). This would eliminate the following
problems: (1) having to make a change to the program and also
having to make the same change to the "near-clone" program,
(2) forgetting to eliminate the "test" code before making
the "production" version, and (3) having to re-insert
the "test" code when needing to test revisions to the
program at a later time. (The author has also used this technique
to allow bypassing all the actual file modification code, thus
allowing testing of update programs against the live production
files without actually updating them). A small amount of foresight
when designing a program can make a large amount of difference
later on.
RULE 20 - Consider adding "PARM" overrides to allow
for easy program testing, eliminated the need for "near-clones",
etc.
1 - Michael Jackson. Principles of Program Design.
2 - E.W. Dijkstra. A Discipline of Programming.
3 - Barry Dwyer. One More Time - How to Update a Master File.
4 - Dan W. Crockett. Triform Programs.