Friday, July 5, 2013

PowerCenter Best Practices

Introduction

Below you’ll find some simple rules one should consider when building PowerCenter Workflows. This list may be further extended - it definitely doesn’t cover everything. And it probably never will. However it may be further extended to include other Best Practices, examples and explanations. It’s also possible that some of the guidelines below may be reviewed and altered anytime in the future. Feel free to read, ask and comment. Keep in mind that the author holds no responsibility for whatsoever :)


General


  1. Limit the number of rows as early in the mapping as possible (SQ, aggregators, joiners, filters) - the more you limit at the beginning of the data flow, the less rows will be processed by all downstream transformations.
  2. For generating IDs use IDENTITY columns or DB triggers. There is a number of reasons:
    1. fewer PowerCenter objects reduce development time and maintenance effort
    2. IDs are PowerCenter independent, so any other app used to populate the target table will not cause any issues
    3. while migrating the code between environments there is no need to worry about the sequence value
  3. If for some reason you need the ID on PowerCenter side, use Sequence Generator rather than Stored Procedure call for performance reasons - Sequence Generator is really fast as it simply gives you numbers in IS memory. Stored Procedure call requires connecting to DB engine, invoking the Stored Procedure, and sending the number back to IS. The communication overhead is huge and may heavily impact performance. This is also a factor of the connection speed and distance.
    If you’re afraid of losing the generated sequence number (which might happen when exporting/importing xml without the Retain Sequence Generator Value option), store the last ID in DB and use it as a start value adding to the sequence and update it with the last ID generated this way.
  4. There should be no unused ports in any transformation (except Source Qualifier) - Source Qualifier should match Source Definition 1-to-1. But for any other transformation there should be no unused ports. Why fetch and process the data for half of the data flow and use extra memory, if you don’t need it?
  5. Avoid implicit conversion, as it is slower and might lead to errors - There are number of issues with implicit conversion (e.g. when you link SQ string port to decimal port in the following expression):
    1. it’s simply not visible. Anyone checking the mapping will not notice that there is any conversion of datatypes
    2. the conversion might behave not as expected (e.g. roundings, trailing/leading spaces or zeros, etc.)
    3. there may be unhandled conversion errors
Therefore use appropriate conversion functions (e.g. TO_DECIMAL) explicitely.

  1. Remove all the 'Error (Transformation error)' messages from default values in Expression transformations - you will avoid getting unwanted error messages in logs. It will not be checked at session initialization.
  2. Consider adding Expression transformation right after each SQ and right before each Target - pass-through expressions do not cause additional computation. But I couldn’t count how many times I had to make “little change” before or after the whole data flow in an existing mapping. Having this dummy expression transformation helps a lot. It’s also very useful when relinking all ports after some changes (using Autolink by name).


Aggregators:


  1. Use as soon as possible - it limits the number of rows processed by any transformation downstream.
  2. Use 'Sorted Input' only if possible, but do NOT sort just for aggregation - This one is a bit tricky. Aggregator with a “Sorted input” property works faster and uses little cache. But aggregating data does not require sorting. Therefore, if dealing with unsorted data, Aggregator (without Sorted Input, large cache) will perform better than Sorter (large cache) + Aggregator (with Sorted Input, small cache). If you do add The Sorter, the cache will not disappear - it’ll be created for the Sorter. Adding Sorter is a good idea only if you need to sort the data anyway.


Expression:


  1. Use local variables for common computation (especially if complex) - If you need to perform some computation for many output ports (e.g. datatype conversion on one of the input ports), you can define a variable port doing it once (e.g. TO_DECIMAL(input_port)) and use the port in expressions for the output ports. This way it will be done once. This is especially important, when the common logic is complex, because:
    1. the performance gain is even grater
    2. it’s easier to maintain - there is one place you need to check for any errors and implement any changes
  2. Avoid local variables if no code reuse opportunity is present - If the output port does some computation and it is not common for other output ports, use the output port expression. Do not add a variable port. Input -> Variable -> Output makes the expression not visible from Mapping Designer (i.e. without opening Expression Transformation window)


Filter / Router:


  1. Any complex expressions should be done in an Expression transformation before the filter / router - it is very important in case of Router. Calculate once, then compare - instead of calculating many times. Remember, that for every input row, all group expressions are checked (Router can have many output for one input row). It is a good idea to do the same for Filter, as for example you can use variable ports in Expression while it is not possible in Filter.
  2. Keep in mind that router executes all expressions against each input row


Floats:


  1. When to use? Only for Physics, where accuracy is not that important. Keep in mind that floating point numbers are not accurate! E.g. 0,1 is stored as ~0.0999999999999999991, which is very close to 0,1, but not equal. You can end up having 0 kept as -0,00000001 (real-life example). This may cause a lot of trouble when selecting all rows greater or equal to zero.
    For any financial, quantitative - almost any calculations use decimals.


Lookup:


  1. Should have only the ports that are being used - The more ports in Lookup, the more data is fetched. It takes more time and memory to build the cache.
  2. Limit the number of rows fetched using filter - Use the Lookup Source Filter. This will limit the amount of data to fetch and keep in memory.


Parameter files:


  1. Use one parameterfile per workflow - There are many ways to use parameters and parameter files. You can define General parameters for all sessions in a workflow. You can have each session use it’s own parameter file. What is the best way? There is no simple answer. You might need different setup for some purposes. However in general it’s good to have one parameter file per workflow. You won’t affect other workflows when changing the parameters. The file will be quite simple (in most cases).
  2. Consider using auto generated parameter files using values stored in DB - Workflows can start (or end) with a generic session creating the parameter file. This would allows easy maintaining all parameter files, parameters, do any checks for naming conventions, etc.


Sorters:


  1. Avoid if possible - do not sort the data unless it is really needed.
  2. Use as soon as possible - if the sorting is needed, sort data early in the mapping and try to take advantage of it using Sorted Input for as many transformations downstream as possible.
  3. Use as long as possible - Try to arrange the mapping in a way that all transformations sharing the sort type come one after the other. Avoid resorting separately for every transformation.
  4. Take advantage on DB sorting if covering index is available - If covering index is available in DB, sort the data using Source Qualifier and use Sorted input for the transformations downstream.
  5. Don’t:
    1. Sort just for aggregator - Although Aggregator works faster and doesn’t use caching (almost) with “Sorted Input” property checked, it is not worth sorting data just for aggregation. This will just move the need for cache from Aggregator to Sorter and in fact will slow down the overall performance as the aggregation algorithm doesn’t require sorting.
    2. Sort for joiner (especially each sorter!) - Do not use two Sorters for Detail and Master groups of Joiner Transformation. In such case both pipes will require caching while for unsorted data just the Master input is cached.


Source Qualifier:


  1. Connect only the necessary ports from SQ - Source Qualifier determines what really needs to be fetched from source. While it should match Source Definition 1:1 with all ports linked (to avoid future mistakes and misunderstandings), it reads only the data for ports linked to the next transformation. This can be observed while generating SQL statement (Properties->Sql Query->Generate SQL) - only the used ports will be listed in SQL statement. This is especially important for wide sources, as this way the amount of data transferred can be limited.
  2. Avoid using SQL override statements - Using the Generate SQL property overrides any and all other settings (i.e. port order, sorting, filters). This is also not visible at the first glance and reduces the code transparency. This should be therefore avoided. If you need to:
    1. sort the data - use "Number Of Sorted Ports" property with the proper ports order
    2. filter the data - use "Source Filter" property
    3. join multiple homogeneous sources - use "User Defined Join" property
SQL override should be used only when some complex statements are needed (e.g. use of DB-specific functions or subquery)
The above is also applicable (in general) to lookup transformation.

  1. All ports between Source Definition and SQ should be connected - As mentioned above, it is good to have SQ matching Source Definition as it reduces the number of future mistakes.
  2. All the datatypes should match Source Definition - In general implicit datatype conversion should be avoided as it may lead to errors and unexpected behavior.


Update Strategy:


  1. Don’t use DD_REJECT - Don't process unneeded data (unless really needed in .bad file, use "Forward Rejected rows"). Otherwise use a Filter transformation to drop unneeded rows of data.


Sorting: DB or Sorter transformation?

When considering performance, a case-by-case analysis is needed. There is no “best-for-all” approach. Database as well as network performance must be considered. Database and Integration Service load must be considered. Database setup must be considered. Finally, Database and Integration Service hardware configuration must be also considered. There are however some general rules that can be applied:

  1. Use DB if covering index is available - Try to take advantage of existing DB indexes. It happens quite often that the source tables are indexed and all that needs to be done is setting the appropriate order of the ports in Source Qualifier and the “Number Of Sorted Ports” property. If not:
  2. Check if it's possible to create covering index - Sometimes the DB index is not available, but can be created. The advantage of it is huge enough, that It’s always good to check. Of course there are scenarios, where you can’t afford to use an index (e.g. there are lots of inserts / updates on a table with rare selects). Investigate if it is possible - if not:
  3. Use Sorter transformation - If no index on source is available and data needs to be sorted, use a Sorter transformation. Keep in mind the remarks discussed above.

1 comment:

  1. Good afternoon,

    How are you?

    I'm from Brazil and working with PowerCenter. About snippet:

    "Don’t:
    (...)
    Sort for joiner (especially each sorter!) - Do not use two Sorters for Detail and Master groups of Joiner Transformation. In such case both pipes will require caching while for unsorted data just the Master input is cached."


    I would like to explore this issue, could you pass me where you got it?

    Thank you!

    Regards,

    Athamis
    athamis@gmail.com

    ReplyDelete