Legacy Data Conversions

With Simple Tools

Matthew Rapaport

Sep. 2002

So you’ve bought Oracle to replace a current database that supports some strategic software at your company. You now face a legacy conversion challenge. Since your company is new to Oracle, you may not have Oracle expertise in-house. Naturally you turn to Oracle for advice, and they are happy to be of service. They will recommend a tool called Oracle Migration Workbench (OMWB). OMWB is a powerful (and expensive) package that will intelligently convert data structures, data, constraints, and even application logic from some target database application to Oracle. Often, however, the challenge is not to migrate the old application to Oracle, but rather to start up a new application, and migrate only data to the new database.

Several factors complicate data transfer in such projects:

Data formats, types, or representations are different in the new database and application.
Business rules are different in the new application.
The new application logic is unrelated to the old. It addresses the same business need, but in a different way.
The new application pushes out to tables or other decision structures what the old application contained in code.
The data in the old application database is dirty.

Do not underestimate the impact of these factors on a migration. OMWB is only marginally useful in these conversions! Where it leaves off, much custom work must be done. The custom effort needed can consume as much as 90% of the project’s time. For the sake of the 10% of the process it can support, the cost of OMWB becomes prohibitive. The alternative, in such cases, is to plan and complete a data migration project of your own. Today there are generic tools available in the most common computing environments that can help get the job done in little time and without the costs associated with large software packages like OMWB.

Types of legacy conversions

Conversions of legacy application data to new Oracle databases are typically of three types. First, a mass data migration, a onetime conversion of thousands or millions of records. Conducted usually at the close of some business period over a weekend or other time during which changes to the source data cease or approach that condition as much as business conditions permit. Temporary-ness characterizes type one migrations. You migrate only once in production. If the company is of any size and the source database large or in use for many years, there are likely many records to convert. Performance is typically an issue with type one migrations.

Next there are partial conversions with portions of the older application used for a time and shut down gradually. For example, a new strategic application has a front end that runs in corporate branches while its back end, at corporate HQ, consolidates data. A business object (loan, order, etc.) passes from front end to back end upon reaching a particular status code. Management decides to mass convert (first type) all data for the last year, and then start (at corporate HQ) only the back end of the new application. Each of the branches continues to enter data through the old application. Front-end data migrates automatically to the new back end when its crosses a preset status. As each branch converts to the new front end, the volume of data migrated slowly diminishes until the need for conversion ends when the last branch changes software.

These type two migrations are also temporary though not onetime events. They may run for months. One can test type one migrations repeatedly so the range of exceptional data is known. This is not the case with type two migrations, which can face novel errors days, weeks, or months after they start. One more characteristic of type two migrations is that data must sometimes revert to the source database. Status’ can change in both directions. Sometimes and order, claim, loan, or payment that has progressed from the front end (old app) to the back end (new app) must be undone and readdressed in the old system before migrating forward again.

Third is what might be called permanent migrations. In this case, the old application is not shut down. It passes data to the new application indefinitely. Such conversions are more akin to data warehouse applications than straight data conversions, though there are often massive quantities of data involved. Because of the permanency, both source and target data structures will evolve over time and must always stay coordinated. Active directories for both databases, embedded in a large software tool that manages data transfer based on direct manipulation of the directory objects, makes the most sense in these situations.

Project vs. application.

There are smaller scale software packages designed to support data migrations such as DataJunction. Such products migrate simple and small databases from sources to targets. They are cost-effective if the source to target map is simple (the two sides need not be identical) and data conversions are simple transformations. While helpful for small migrations, they are not usually robust enough to handle large projects involving dozens or hundreds of tables, and millions of records.

The approach taken here applies to large projects, where the simple “does it all” tool will not do at all. In their place, a managed project uses the native languages of the source and target databases, coupled with a powerful filter building and process control language to construct a data bridge between the old and new databases. Integration of the two databases is supported by structure descriptions in a spreadsheet then used by the intermediate language to generate supporting data structures and migration code. Manage the whole like any project depending on the size and complexity of the databases and applications involved.

Since Oracle is the target database, its native languages and utilities will include Java, PL/SQL, SQL+, and SQL Loader. On the source side, the language used will vary with the database, but must only dump data to flat files of a configurable format, while extracting (via some business or programming rules) the right data. The intermediate language can be any powerful scripting language having the following characteristics:

Rapid development without compile and link cycles.
Very strong string handling, data typing and structuring capabilities.
Links (via ODBC) to Oracle (not always required).
Rich control structures. Ability to map large-scale data formats (for example XML, X12) to logic structures.
Strong ties to the underlying OS for file I/O, multi-threading, and other high and low level OS calls.

Perl, TCL, and a few other languages satisfy all of these requirements. Perl, amongst these, seems to have the broadest support in the Unix and Windows environments. Java is a marvelous language for many kinds of applications associated with Oracle. Its performance in raw record reads and writes per second does not match that of PL/SQL, SQL+, or SQL Loader. Because migrations are so often performance driven, we recommend use of the faster data manipulation tools. Java could be used in place of the scripting languages, but there its development time is greater and modification more cumbersome. Being a onetime or short duration job, the high development productivity of a robust script language fits the application best.

OMWB compared to the generic tools approach.

Table 1 compares OMWB support for the various parts of a migration project and how generic tools support them.

We can see from Table 1 the usefulness of OMWB depends on migration requirements. OMWB delivers the most bang for its buck if:

The source DB and application are readable (plug-ins are available) by OMWB. Many of the most important databases are supported including DBII, Access, and MySQL. Oracle is always adding more. Yet others are supported through Oracle partners who specialize in various sources.
The source application is not too old – little dirty data.
The goal is to reproduce both the database and the application in the target environment.

The less the target DB and application are like the source, the less OMWB can do. Besides the tool itself, a company must often contract (from Oracle) the expertise needed to run it. OMWB justifies its cost when the target is mostly a reproduction, in Oracle, of the source. Its results are dramatic, even if a little human work is required to smooth its raw output. If the target application and database are structurally and functionally different from the source, OMWB will not dramatically shorten the project, and its cost is not justified. Source to target mapping and transformation analysis can consume 50% of the total project time (but not resources)! OWMB cannot automate this work if the target is different from the source. At some point, the manual extension of OMWB produced migration code takes more time than creating the migration code internally. SQL Loader, SQL+ and PL/SQL not only perform well, they develop rapidly.

OMWB can support both type one and type two migrations, though the type two necessitates more manual effort even where the database and application are duplicated. It can be advantageous to have the source application build the record collection (business object) to be migrated. Unless the source and target structures are copies of one another, OMWB cannot deduce transaction boundary information for the target automatically. For the sake of performance, one often ignores such boundaries in type one migrations (you have to account for them eventually of course). New transaction associations cannot be ignored when records migrate in collections making up an order, claim, loan, assembly or other business-specific object. Finally, OMWB has no way of accounting for the all-to-common type two demand that objects must sometimes revert to the source system! Reversion logic is a separate issue with or without OMWB. With OMWB, it’s a separate project. Using generic tools, it is a small addition to scope of the migration over all.

Addressing migration requirements with project management

Figure 1 shows the over-all workflow in a migration. Thick arrows represent data flow. Workflow over all may be similar for both type one and two migrations, though the flat files extracted from the source database for type two migrations will represent an object (e.g., an order), not just individual source tables. If the database to be moved is very large and performance demands high, the SQL Loader intermediate step (Fig. 1a) will help wring the maximum performance from the later SQL+ and PL/SQL step. In type two migrations, direct file I/O often replaces the SQL Loader step as in Fig 1b. If the over-all transaction rate is low enough, or data can trickle into the new system, then perl can take direct responsibility for loading Oracle (Fig 1c). Perl can control everything from the initial flat file to the final target load, and possibly also (depending on the source system) the initial data extraction.

Fig. 1. Over all work flow in migration project.

Figure 2 shows the generation of DDL (Data Definition Language) code, SQL Loader code, and SQL+ or PL/SQL import procedures in the migration project. Perl can automatically produce DDL from simple text descriptions of the source tables. Note these are intermediate, temporary tables that look like source objects. We use them because Oracle loads them very fast, and SQL+ cannot take data from flat files. Creating PL/SQL is more complex with intermediate tables, and easier when data comes from flat files organized around business objects as they tend to be in type two migrations. PL/SQL is necessary where SQL+ cannot be used; usually because of transaction boundaries, or transformation decisions dependant on previously loaded tables. These will typically be the last tables loaded in the final stages of a type one migration.

Fig 2. Generating DDL and import code

In the migration jobs I’ve done, the number of kinds of PL/SQL procedures needed fall into between 4 and 8 categories. The categories emerge from mapping data from source to target. It is necessary only to create a skeleton PL/SQL procedure for each category. The skeleton consists of blocks (BEGIN, EXCEPTION, END statements) with associated stubs for file I/O or cursors, and a standardized set of exception statements suitable for that block type. Perl can read in the skeletons, and the data map and produce a compile ready PL/SQL procedure. SQL+ scripts, where appropriate, are even easier. Perl’s pattern recognition plus its ability to substitute values for variables directly in source text make it an ideal language for this task.

The figure does not show the data reversion requirement often a part of type two migrations. Usually, it is not necessary strictly speaking to migrate data backwards at all. Instead this process involves

Deleting (or marking deleted) the reverting target record(s). Often, flagging the parent master record is enough.
Signaling the record reversion to the source database application.
Causing the source database application to alter the status of the correct records based on the signal from the target.

This isn’t hard to do usually, but application design must make provision for retaining any source-to-target key alterations (new keys). The source database is informed with its old key, while the new key may or may not be re-used when the records migrate forward again. Accounting for the key change history adds a little more complexity to the normal forward migration. It is best added early to the scope of the project. While not usually complicated, this is something that OMWB cannot automate.

Type three migrations add a new set of needs stemming from their permanency. These migrations can be much like type one or twos, depending on their specific requirements, but because they are permanent, they bear the added burden of being dynamic in the sense that the structures of both the source and target databases will evolve over time. That is what makes type three migrations more like data warehouse applications. The elaborate database of structures, code, and maps maintained by OMWB becomes obsolete the moment the target application begins to change. In a type three migration, both the target and the source application change over time, and the database that handles the connection information must control the transfer of data dynamically, and not through regenerated static code. This is the proper province of a data warehouse tool, not a migration tool.

Summing up.

Large-scale data migrations between unlike database structures need not be painful if approached with a proper respect for the complications caused by large volumes of older data. The various steps and scripts developed to support the effort need not be individually complex. Keeping each step in the process simple depends on:

A through analysis. Someone must understand the data!
Performing data transformations at the right step with the proper tool for the job.
Keeping the code simple, suitable to a “one-shot” job.
Addressing exceptions simply. Making it convenient for humans to intervene in the process.
Providing enough test time to measure performance and prove the process.

Large tools will continue to evolve and better support a wider variety of conversion requirements. Yet there will always remain some jobs better off done step by step such that dirty data and structural errors in the source system are not simply copied to the target. In these cases, a migration project can be successfully completed with a minimum of effort by choosing the right tools from which to build the bridge from the source to the target database. Beyond this, allowing enough time to properly understand the data, and time to test the resulting software will guarantee a quality result.