ETL

Data warehouses quite often need to query remote data over database link. These types of queries may involve significant volume of data. When a data warehouse is loaded incrementally, then usually a fraction of the remote table is needed. So, it is natural to add a WHERE clause to the query. However, what shall we do if with a query, for example like this one:

SELECT *
 FROM clients@otherdb a
 WHERE "ClientNo" = '5562541333';

is getting a very disappointing plan like below:

------------------------------------------------------------
| Id  | Operation        | Name    | Rows  | Inst   |IN-OUT|
------------------------------------------------------------
|   0 | SELECT STATEMENT |         |    20 |        |      |
|*  1 |  FILTER          |         |    20 |        |      |
|   2 |   REMOTE         | CLIENTS |       | OTHER~ | R->S |
------------------------------------------------------------

The whole remote table CLIENTS is pulled to the local database and only then the result is filtered. Of course, this is far from being optimal. There might be many reasons why Oracle optimizer decides not to send the WHERE clause over database link, especially when the remote database is not Oracle. One of the most common reasons is mismatch of data types.

In my example I have this WHERE clause:

"ClientNo" = '5562541333'

If Oracle has to match data types to execute the comparison, then the conversion has to be done on either of the sides. If the conversion occurs on the left side, i.e. CONVERSION_FUNCTION(“ClientNo”), then it will have to pull the whole table into the local database. We can easily prevent this from happening by explicitly converting the right side of the comparison to the data type of the left side, for example:

SELECT *
FROM clients@otherdb a
WHERE "ClientNo" = CAST('5562541333' AS NVARCHAR2(10));

Let’s see the new execution plan after applying this change

------------------------------------------------------------
| Id  | Operation        | Name    | Rows  | Inst   |IN-OUT|
------------------------------------------------------------
|   0 | SELECT STATEMENT |         |    20 |        |      |
|   1 |  REMOTE          | CLIENTS |    20 | OTHER~ | R->S |
------------------------------------------------------------

This plan looks much better. The WHERE clause is sent to the database link and executed remotely. The new performance will be much better.

I played around using different ways of providing the filtering to the remote query run on MS SQL Server and found that the WHERE clause was not sent when I used:

join – unfortunately DRIVING_SITE hint was constantly ignored
subquery with IN operator, even if the PRECOMPUTE_SUBQUERY hint changed subquery result into literals
PL/SQL function

and it worked fine when when I used:

literal with explicit conversion
IN operator with literals (explicitly converted)
bind variable (explicitly converted)

I will be glad to hear your experiences about querying remote database over database link.

Job flows

Most of data warehouses include some sort of ETL jobs to refresh its contents. Typically those jobs can be divided into two categories: jobs that need to be executed in a specific order and jobs that may run in parallel independently from each other. There are also other constraints involved that make the job flows more complex.

Developing a job flow is an iterative process. First we add just a few jobs to the flow. We run it, then we add a few more jobs, and so on. The job flow keeps growing. When all the jobs are ready, the flow usually is far from being optimal. The tuning is needed. Unless you followed a good design from the very beginning, the flow will have to be restructured in order to optimize its execution time and load on the machine it runs at. Continue reading →

Data Warehousing Snippets

Just another Oracle technology blog

WHERE clause over database link

ETL job flows tuning

Job flows