ETL job flows tuning

Job flows

Most of  data warehouses include some sort of ETL jobs to refresh its contents. Typically those jobs can be divided into two categories: jobs that need to be executed in a specific order and jobs that may run in parallel independently from each other. There are also other constraints involved that make the job flows more complex.

Developing a job flow is an iterative process. First we add just a few jobs to the flow. We run it, then we add a few more jobs, and so on. The job flow keeps growing. When all the jobs are ready, the flow usually is far from being optimal. The tuning is needed.  Unless you followed a good design from the very beginning, the flow will have to be restructured in order to optimize its execution time and load on the machine it runs at.

Tune only the jobs that matter

If your job flow has jobs that execute in parallel, then only the jobs that make the critical path matter to the overall flow duration should be tuned. However if your flow is completely serial, then duration of each job matters. In the example below we have a flow with 2 parallel threads. Jobs C and D make the critical path, so tuning of the job C may give the most benefit.

Job flow gantt diagram

Job flow gantt diagram

However tuning of the job C might not be that easy as job D. The job C might have already been tuned and further tuning might be tougher. For example the simple change in the job C took 1 hr of the tuning effort resulting in 30% improvement. But you know that additional improvement of 10% will cost you 3 more hrs of the tuning effort. Thus, taking into account tuning effort is also necessary to maximize the “tuning gain/tuning effort” ratio. I am proposing the following approach:

  1. Collect execution duration statistics for all the the jobs in the flow
  2. Draw job durations and their dependencies in a form a Gantt diagram
  3. Find a critical path (sequence of jobs making the longest execution path)
  4. Review the longest running jobs and assess how difficult it is (how long it will take) to get them tuned
  5. Create a ranking by “tuning gain/tuning effort” ratio
  6. Tune the job with the best “tuning gain/tuning effort” ratio
  7. Rerun the flow, assess the results and recheck the critical path
  8. Repeat the above steps until you reach your tuning goal

Remove unnecessary dependencies

If the flow has all its jobs running in a sequence, then very likely some dependencies are redundant, ie. the flow could take shorter only if the unnecessary dependencies are removed.

Serial Jobs

Serial jobs

If we analyse carefully the logical dependencies between jobs, ie. logical dependency between jobs X and Y exists only when the job X requires results produced by the job X. We may realize that some jobs dependencies may be removed, because the jobs marked as dependent do not require anything from their direct predecessors. In our example jobs C, D, E, F depend only on the results of the job B. Those unnecessary dependencies can be removed.

Parallel jobs

Parallel jobs

In case you have large number of jobs that can run in parallel, consider limiting them to avoid system overload.

Typically your jobs will fall into one of the following categories:

  • jobs that depend on completion of one or more other jobs (jobs dependencies)
  • jobs that can run independently (free execution)
  • sometimes: jobs waiting for a specific time to start (time dependency)
  • sometimes: jobs waiting for a resource to be available (external event dependency)
  • sometimes: jobs using an exclusive access resource

When restructuring your job flow it is good to define only necessary dependencies and leave non-constrained jobs running in parallel. On the resource management side of your scheduling system set the maximum number of jobs that can run in parallel. Avoid limiting parallel jobs by using dependencies. This approach will allow your job flow to achieve best performance while fitting into logical (dependencies) and processing capacity (number of parallel jobs) constraints.

Set priorities for parallel jobs

Jobs, that can run independently, should be scheduled using the rule “the longer a job runs, the earlier it should start”. It rarely happens that a job always takes exactly same time to complete. However looking at historical timings it is possible to measure jobs average run time. The longest running jobs should start earlier, because when they are run in parallel, shorter jobs easier fit into remaining time slots. This is better explained at the picture below.

In our example we have 6 jobs:

  • 3 jobs running 1 minute
  • 2 jobs running 2 minutes
  • 1 job running 3 minutes

They are run in 2 parallel threads. We will consider 2 scenarios:

  • Scenario A: shorter job starts sooner than longer ones
  • Scenario B: longer job starts sooner than shorter ones
Job priorities

Job priorities

As you can see, the scenario A took 6 minutes – The Thread 2 was idle for last 2 minutes. While the scenario B was having nicely balanced load. Its both threads finished at the same time and the whole flow took only 5 minutes.

Avoid overloading of your machine

Another factor is capacity of your machine. If you run too many jobs at the same time, then you risk overloading it. You may run out of memory, you risk exhausting system’s I/O and your jobs will not run at full speed because the resources (I/O, memory, CPU) will be limited. It is even worse for the performance. It is a good practice to do some benchmarking to get knowledge where your system’s performance starts degrading  Having this knowledge you can plan your job load with a safe distance from the performance breakdown. You may also check with your capacity plan document about the percentage (in terms of CPU, memory, TEMP space) of the system you are allowed to use.

If you like or disagree with what I wrote, please leave a comment on this posting.

Leave a Reply

Your email address will not be published. Required fields are marked *