Start Where the Truth Is: the Director Log
Every DataStage job failure has a story in the Director log. Read it top-down and resist fixing the last red line first — the first fatal entry is usually the root cause; everything after is fallout.
- Fatal — what actually stopped the job. Start here.
- Warning — often benign (type coercion, nulls), but a flood of them hides the one that matters. Suppress known-safe warnings so the real ones stand out.
- Info — row counts and timings per stage, your bottleneck map.
Find the Slow Stage
In a parallel job the slowest stage gates the whole pipeline. The per-stage rows/second in the log (or the performance analysis monitor) tells you where time goes. Common culprits:
- A database stage doing row-by-row — ensure array size / bulk options are set, not single-row inserts.
- A Sort or Aggregator spilling to scratch disk — raise the stage's memory or reduce data earlier in the flow.
- A Transformer with heavy per-row functions — push logic into the source SQL where possible.
Partitioning Skew: The Parallel-Job Killer
DataStage parallelism only helps if rows spread evenly across partitions. Hash-partition on a low-cardinality key and one node gets most of the data while the others idle — the job runs at single-node speed despite N nodes. The fixes are the same instinct as a database distribution key:
- Hash-partition on a high-cardinality key.
- Keep the partitioning consistent across stages that join/aggregate, to avoid silent repartition + re-sort.
- Use Round Robin only where order and grouping do not matter.
Don't Let Rejects Disappear
-- Wire a reject link off database and transformer stages,
-- land rejects in a table/file with the error code, and
-- fail the job (or alert) if the reject count crosses a threshold.
A job that "succeeds" while silently dropping 5% of rows to an unmonitored reject link is worse than one that fails loudly. Always capture rejects and count them.
The Database Underneath the ETL
Half of "slow DataStage job" tickets are really a slow source or target database — a missing index on the lookup table, a target table bloated and starved of vacuum. When the warehouse is PostgreSQL, PG Monitoring shows you the exact query the ETL stage is running, its plan, and whether it regressed — turning "the job is slow" into "this lookup lost its index last Tuesday."