Introduction
Most
batch processes for business applications today are still implemented
by procedural languages. The typical approach is to process a record (or
a set of related records) at a time until all the input data is
exhausted. This record orientated processing method has been with us
since the dawn of data processing. It has served us well for so many
years and is still the only processing method used by a lot of
programmers. However, it starts to reveal its weakness at the arrival of
large-scale relational databases.
Business
data in a relational database is stored in tables. Data from these
tables are retrieved by SQL statements and stored as variables within
the program. Sometimes data from dozens of tables are required to
perform a single business operation. Under that situation assembling the
input data is itself a major effort, and the coding becomes more
tedious as the number of input tables increases. It is not unusual for a
programmer to spend more time assembling the input data than
implementing the business logic.
To
make the matter worse, not all the data can be retrieved at the
beginning of the program. This is because some of these data retrievals
are conditional upon the outcomes of certain computations at a later
stage. The debugging process will also become more difficult when the
number of IF THEN ELSE conditional commands grows.
Experienced
programmers have noticed that the number of variables for their input
data can be substantially reduced if they do more screening and checking
in the SQL SELECT statements before passing the data to the host
program. In practice this means retrieving input data by a more complex
SQL statement that joints several input tables. In this way SQL is in
fact playing a more important role because part of the processing logic
is actually handled by its WHERE clause. This is true that such an
approach can cause a performance problem when the programmer doesn't
have the knowledge to properly tune the SQL statement, or when the
existing index structure can't efficiently support the operations
performed by such a complex SELECT.
The work table approach
We
do like the idea of letting SQL do more in the process. In fact, we
want to let SQL do everything. Our approach is to create one or more
work tables to house all the required data, which includes data
retrieved from the database and the temporary data required to carry out the
computation. At the beginning of the process this work table will be
populated by a series of Insert and Update SQL statements that draw data
from the database. These are relatively simple SQL statements because
each of them will only copy data from one or two database tables. The
operation is done within the SQL environment and no procedural language
commands will be involved. At the end of this step all the required data
is available in these work tables.
The
next step is to continue our computation on the work tables to produce
all the data for the output. This is also done by Update or Insert SQL
statements.
When all the output data
is available the additional Insert or Update statements will be used to
update the database. If necessary, the procedural language can take over
the control at this stage to produce a report from this work table.
In
such an approach we are not trying to reduce the number of SQL
statements. On the contrary we will have more SQL statements in the
program. But each SQL statement will only be responsible for a simple
task and the result is visible from the work table. To debug such a
program you simply execute these statements one by one and inspect the
content of the work table at each step.
An example
The following example is for illustration purposes only.
Suppose
budgets for a given year have been determined manually for each
department and the amounts have to be distributed to all the expense
accounts within the same department proportional to last year's actual
expenses.
Assuming input budget data
for each department has been created in a table BUDGET_INPUT, a program
is required to insert into the LEDGER table the budget figures for each
expense account within these departments. The work table for this
program will be called BUDGET_WORK_FILE.
Tables and columns used in this process:
1. BUDGET_INPUT table - the table that contains the budget figure for each department.
BUDGET_INPUT (year, department, record_status, department_budget)
2. LEDGER table - the database table for the account balance by year, department, and ledger_type (Actual or Budget)
LEDGER (account, department, year, ledger_type, amount)
3. ACCOUNT_TYPE table
ACCOUNT_TYPE (account, account_type)
4. BUDGET_WORK_FILE table - Temporary work table created for this process
BUDGET_WORK_FILE (account, department, budget_amount, last_year_actual_amt, last_year_department_total, process_step)
**** The Program ****
/* Statement 1: To delete the old data in the work file */
delete from budget_work_file;
/* Statement 2: To copy to work table the actual expense from last year by department and account */
insert into budget_work_file
(account, department, last_year_actual_amt, process_step)
select a.account, a.department, sum(a.amount), 'LAST_YR_ACTUAL'
from account_type b, ledger a
where a.account = b.account and
b.account_type = 'EXPENSE' and
a.ledger_type = 'ACTUAL' and
a.year = 2011 - 1
group by a.account, a.department;
/* Remark: We use 2011 - 1 instead of 2010 because the year entry can be replaced by a bind variable later. */
/* Statement 3: To calculate last year's total expense by department */
update budget_work_file a
set last_year_department_total
= (select sum(last_year_actual_amt)
from budget_work_file b
where b.department = a.department and
b.process_step = 'LAST_YR_ACTUAL')
where a.process_step = 'LAST_YR_ACTUAL';
/* Statement 4: To distribute the budget figures to each department based on last year's expense */
insert into budget_work_file
(account, department, budget_amount, process_step)
select a.account, a.department,
b.department_budget * a.last_year_actual_amt / a.last_year_department_total,
'BUDGET_OUTPUT'
from budget_input b, budget_work_file a
where b.department = a.department and
a.process_step = 'LAST_YR_ACTUAL' and
b.record_status = 'APPROVED' and
b.year = 2011 and
a.last_year_department_total <> 0;
/* Remark: We assume that there is only one row in budget_input per department for the same year */
/* Statement 5: To copy the budget by account and department to the LEDGER table */
insert into LEDGER
(account, department, year, ledger_type, amount)
select account, department, 2011, 'BUDGET', budget_amount
from budget_work_file
where process_step = 'BUDGET_OUTPUT';
Please
note that statement 4 and 5 can be combined to insert budget figures
into LEDGER directly from the two source tables in statement 4. But
keeping an audit trail for the output data in the work table is not a
bad thing.
The main difference
between this approach and the procedural approach is that each INSERT or
UPDATE operation will generate or update so many rows in the table,
whereas the procedural in general can only update one row at a time.
The
procedural language can still have a minor role in this approach. For
example, these SQL statements can be embedded in a procedural language
program so that the year entry (2011) is replaced by a bind variable.
The value of this bind variable will come from the input parameters.
We
have to emphasize that the example is constructed solely for
demonstration. It does not take into account all the details. For example,
there could be new accounts set up in 2011 that were not in the 2010
ledger
The Control Column
You
may have already noticed that the column PROCESS_STEP plays an
important role to identify the set of data to be processed. This is what
we call the Control Column. This control column should have a distinct
value for each Insert statement. It should also be reassigned a new
value in any important Update statement to identify those rows that have
actually participated in the update.
The
work table should have sufficient columns to support all the
computations. Since it will only be used for the process itself, we
should be allowed to do anything necessary to provide maximum efficiency
for the process, including building the suitable indexes to support our
operation. For example, statement 3 will need an index on the DEPARTMENT
column. We may even want to break the rule of normalization and keep
some redundant data if that can make the program run faster.
Some
programmers may like to delete all the rows of the work table at the
end of the process, others would prefer to keep the work table data as an audit trail until the next run.
Concurrent processing
Some
applications may have to allow two or more sessions of the same program
to be run at the same time. In that case additional effort is required
to structure our work table. Normally the business requirement will give
us hints to avoid potential contention. For example, if the
process is to update the LEGDER with a batch of journal entries
identified by the batch number, then two different sessions can work
simultaneously if all the operations in the program are applied to work
table data with its own batch number.
If
you can not find a key item such as the batch number to "partition"
your work table, there is always a way to generate a unique number (e.g.
a session counter) for your session to serve that purpose. How to
support concurrent runs in a batch process is always an issue no matter
what approach you take. But in this approach the issue is more
noticeable.
In the more recent
versions of Oracle DBMS you can create temporary tables which are accessible only within your own session. That could be a good vehicle
for your work tables.
For a
complicated batch process you may need more than one work table. Suppose
we are writing a payroll program that calculates the incomes, benefits,
deductions, taxes, and net pay for each employee. We may need to have a
"pointer" work table to identify all the employees that will be
included in this run. We may have another work table to house the
related data for their incomes, benefits, and deductions. If the salary
data such as hourly rates are not stored in a suitable format for our
SQL operation, we may need another work file to store the related
hourly rates in exactly the format we want. The execution time to
populate such a temporary hourly rate table is negligible in comparison
to the benefit it provides.
When to Commit
The
next issue that we want to discuss is when to apply the database
commit. In the example above we update the database table LEDGER once at
the end of the process. If there is any internal or external condition
that causes the program to terminate before that step we can always
restart from the beginning because no permanent database table would have been updated. Therefore abnormal termination of this process will not
cause problems to the database.
However,
if there are two updates on two permanent tables in the program then we
should try not to Commit the database transaction between these two
update statements. Otherwise when the system abnormally terminates
between these two steps the integrity of the database will be violated.
If you can you should place all the database table update statements at
the end of the program and place no Commit between them.
We
understand that there are situations where we need to Commit database
transactions between two database updates. In that case sufficient data
should be captured in your work table to make these updates reversible.
When
to place your database commit is also an issue for the other
approaches. But it is more important in the work table approach because
each update or insert statement could change a lot of data in that
table. If possible, we should keep enough data to make the entire
process reversible.
Beyond batch processes
In
this article we use the term Batch Process casually. In fact, this work
table approach is not restricted to Batch Processes in the technical
sense. In my previous job an on-line program which gathers data from 7
tables in an Asset Management System to produce transactions for mass
transfer of assets was successfully implemented by this work table
approach.
The strength
In
this approach overhead is reduced to an absolute minimum. Everything
you need to know for a statement can normally be displayed on the same
page of your screen. There is no need for you to go elsewhere in the
program to do cross referencing. The highly independent nature of the
SQL statement and its compact format not only simplifies the debugging
process, but also makes program modification a much easier task.
A
friend of mine said the predicate logic in the WHERE clause of a SQL
statement is one step closer to human thinking as compared with the
If-Then-Else types of logic. I totally agree after I implemented a large
batch program in a project costing system. I remember my user gave me
the following list of instructions:
1. Identify all the capital projects
2. Copy the total outstanding amounts in the transaction file for these projects by account, department,
and cost type (labor or material)
Remark: There is no account code in the transaction file for labor cost type.
3. For each project and department prorate the total labor cost by material accounts.
4. .......
I
was amazed to find out later that after setting up the work table I was
able to translate her instructions into SQL statements one by one.
The weakness
The
work table approach can not handle any recursive formula or recursive
process directly. Recursive formula is normally implemented by iteration
commands in a procedural language. However, we can still use the work
table method inside the procedural loop if that is beneficial.
We
also noticed that the work table approach demands an above average
skills in SQL programming. Sometimes knowledge beyond the core SQL will
come in handy to solve sophisticated problems. For example, we often use
functions like DECODE, TO_CHAR, TO_DATE, ADD_MONTHS in Oracle SQL.
Programmers that are not well versed in SQL programming language may not
be able to take full advantage of this approach.
Conclusion
What
we are talking about here is not a new approach. A lot of programs have been
written in similar fashions. Sometimes the term Set Processing is used
for this kind of program. But I haven't seen any systematic discussion
on the work tables. Actually most of the Set Processing programs are
doing direct Update and Insert on the permanent tables. Without using
work tables and the control columns the method will suffer from a severe
handicap.
I have been using the work
table approach for many years and I have reduced my development cost
and maintenance cost substantially. I will not go back to the procedure
approach.