Monday, October 21, 2013

XML Analyzer for Informatica PowerCenter

It is my pleasure to present to you the

XML Analyzer for Informatica PowerCenter


This is a tool created to fulfill the following purposes:
  • Create Source-to-Target dependency analysis
  • Create Technical Solution Documentation
  • Perform automated code validation
All in an automated fashion and with (almost) a single click. The tool is now available in few flavours:
  • Online tool - available without installation, anywhere you'd need it.
  • Server version - you can have your own instance available in intranet. Available to all developers within the company without having to send code over the Internet.
  • Offline executable - a simple executable tool that allows to generate html Technical Solution Documentation in batches.
All of the above versions are general in their purpose. They cover sample checks and report some general issues if found. Feel free to contact me if you'd need a tailored solution, covering company- or project-specific validation and reports. Use the features listed to come up with any requirements that would suit your business or technical requirements.

Source-to-Target dependency analysis


Here's a description of how it has all started and what was the original goal fot the tool. Or go directly to this source-to-target analysis sample table where you can see a sample mapping analysis. All Source columns with their corresponding Target columns. You can easily check what target colums are influenced by a given source to perform impact analysis, as well as check where do the values come from for any given target column. You can click on column names to sort the table. Shift+click for multiple selections.

Technical Solution Documentation


The Source-to-Target analysis is performed for all mappings found withing the Workflow XML file. This - along with workflow, worklet (if any) and details for all the sessions creates a technical report covering all implementation details. All available at a glance, with a table of contents, can be used to create documentation for whole project.

Automated Code Validation


While creating the report, different checks are performed to ensure the high development quality. This is very helpful in large scale projects with many developers involved and allows all team members to follow the same practices. All developers can have their code quickly checked to make sure there are no typos in log file names, or any debugging-mode options left on before having their code released to production environment. Here is the list of current check performed by the tool. As said, it can be easily expanded to accomodate any additional quality tests - please contact me if you'd need a dedicated solution.
  • Comparing workflow log file with workflow name
  • Checking if a session is valid
  • Checking if a mapping is valid
  • Tracing level
  • Overridden session properties
  • Hardcoded connections
  • Comparing session log file with session name
  • Checking if session is not DISABLED
  • Expression containing comment (--)
  • Filter condition containing comment (--)
  • Router group expression containing comment (--)
  • Lookup SQL Override containing comment (--)
  • Source Qualifier SQL Override containing comment (--)
  • Connections with names containing 'tmp', 'temp' or 'test'
Check the main XML Analyzer page for up to date feature list!

Friday, July 5, 2013

PowerCenter Best Practices

Introduction

Below you’ll find some simple rules one should consider when building PowerCenter Workflows. This list may be further extended - it definitely doesn’t cover everything. And it probably never will. However it may be further extended to include other Best Practices, examples and explanations. It’s also possible that some of the guidelines below may be reviewed and altered anytime in the future. Feel free to read, ask and comment. Keep in mind that the author holds no responsibility for whatsoever :)


General


  1. Limit the number of rows as early in the mapping as possible (SQ, aggregators, joiners, filters) - the more you limit at the beginning of the data flow, the less rows will be processed by all downstream transformations.
  2. For generating IDs use IDENTITY columns or DB triggers. There is a number of reasons:
    1. fewer PowerCenter objects reduce development time and maintenance effort
    2. IDs are PowerCenter independent, so any other app used to populate the target table will not cause any issues
    3. while migrating the code between environments there is no need to worry about the sequence value
  3. If for some reason you need the ID on PowerCenter side, use Sequence Generator rather than Stored Procedure call for performance reasons - Sequence Generator is really fast as it simply gives you numbers in IS memory. Stored Procedure call requires connecting to DB engine, invoking the Stored Procedure, and sending the number back to IS. The communication overhead is huge and may heavily impact performance. This is also a factor of the connection speed and distance.
    If you’re afraid of losing the generated sequence number (which might happen when exporting/importing xml without the Retain Sequence Generator Value option), store the last ID in DB and use it as a start value adding to the sequence and update it with the last ID generated this way.
  4. There should be no unused ports in any transformation (except Source Qualifier) - Source Qualifier should match Source Definition 1-to-1. But for any other transformation there should be no unused ports. Why fetch and process the data for half of the data flow and use extra memory, if you don’t need it?
  5. Avoid implicit conversion, as it is slower and might lead to errors - There are number of issues with implicit conversion (e.g. when you link SQ string port to decimal port in the following expression):
    1. it’s simply not visible. Anyone checking the mapping will not notice that there is any conversion of datatypes
    2. the conversion might behave not as expected (e.g. roundings, trailing/leading spaces or zeros, etc.)
    3. there may be unhandled conversion errors
Therefore use appropriate conversion functions (e.g. TO_DECIMAL) explicitely.

  1. Remove all the 'Error (Transformation error)' messages from default values in Expression transformations - you will avoid getting unwanted error messages in logs. It will not be checked at session initialization.
  2. Consider adding Expression transformation right after each SQ and right before each Target - pass-through expressions do not cause additional computation. But I couldn’t count how many times I had to make “little change” before or after the whole data flow in an existing mapping. Having this dummy expression transformation helps a lot. It’s also very useful when relinking all ports after some changes (using Autolink by name).


Aggregators:


  1. Use as soon as possible - it limits the number of rows processed by any transformation downstream.
  2. Use 'Sorted Input' only if possible, but do NOT sort just for aggregation - This one is a bit tricky. Aggregator with a “Sorted input” property works faster and uses little cache. But aggregating data does not require sorting. Therefore, if dealing with unsorted data, Aggregator (without Sorted Input, large cache) will perform better than Sorter (large cache) + Aggregator (with Sorted Input, small cache). If you do add The Sorter, the cache will not disappear - it’ll be created for the Sorter. Adding Sorter is a good idea only if you need to sort the data anyway.


Expression:


  1. Use local variables for common computation (especially if complex) - If you need to perform some computation for many output ports (e.g. datatype conversion on one of the input ports), you can define a variable port doing it once (e.g. TO_DECIMAL(input_port)) and use the port in expressions for the output ports. This way it will be done once. This is especially important, when the common logic is complex, because:
    1. the performance gain is even grater
    2. it’s easier to maintain - there is one place you need to check for any errors and implement any changes
  2. Avoid local variables if no code reuse opportunity is present - If the output port does some computation and it is not common for other output ports, use the output port expression. Do not add a variable port. Input -> Variable -> Output makes the expression not visible from Mapping Designer (i.e. without opening Expression Transformation window)


Filter / Router:


  1. Any complex expressions should be done in an Expression transformation before the filter / router - it is very important in case of Router. Calculate once, then compare - instead of calculating many times. Remember, that for every input row, all group expressions are checked (Router can have many output for one input row). It is a good idea to do the same for Filter, as for example you can use variable ports in Expression while it is not possible in Filter.
  2. Keep in mind that router executes all expressions against each input row


Floats:


  1. When to use? Only for Physics, where accuracy is not that important. Keep in mind that floating point numbers are not accurate! E.g. 0,1 is stored as ~0.0999999999999999991, which is very close to 0,1, but not equal. You can end up having 0 kept as -0,00000001 (real-life example). This may cause a lot of trouble when selecting all rows greater or equal to zero.
    For any financial, quantitative - almost any calculations use decimals.


Lookup:


  1. Should have only the ports that are being used - The more ports in Lookup, the more data is fetched. It takes more time and memory to build the cache.
  2. Limit the number of rows fetched using filter - Use the Lookup Source Filter. This will limit the amount of data to fetch and keep in memory.


Parameter files:


  1. Use one parameterfile per workflow - There are many ways to use parameters and parameter files. You can define General parameters for all sessions in a workflow. You can have each session use it’s own parameter file. What is the best way? There is no simple answer. You might need different setup for some purposes. However in general it’s good to have one parameter file per workflow. You won’t affect other workflows when changing the parameters. The file will be quite simple (in most cases).
  2. Consider using auto generated parameter files using values stored in DB - Workflows can start (or end) with a generic session creating the parameter file. This would allows easy maintaining all parameter files, parameters, do any checks for naming conventions, etc.


Sorters:


  1. Avoid if possible - do not sort the data unless it is really needed.
  2. Use as soon as possible - if the sorting is needed, sort data early in the mapping and try to take advantage of it using Sorted Input for as many transformations downstream as possible.
  3. Use as long as possible - Try to arrange the mapping in a way that all transformations sharing the sort type come one after the other. Avoid resorting separately for every transformation.
  4. Take advantage on DB sorting if covering index is available - If covering index is available in DB, sort the data using Source Qualifier and use Sorted input for the transformations downstream.
  5. Don’t:
    1. Sort just for aggregator - Although Aggregator works faster and doesn’t use caching (almost) with “Sorted Input” property checked, it is not worth sorting data just for aggregation. This will just move the need for cache from Aggregator to Sorter and in fact will slow down the overall performance as the aggregation algorithm doesn’t require sorting.
    2. Sort for joiner (especially each sorter!) - Do not use two Sorters for Detail and Master groups of Joiner Transformation. In such case both pipes will require caching while for unsorted data just the Master input is cached.


Source Qualifier:


  1. Connect only the necessary ports from SQ - Source Qualifier determines what really needs to be fetched from source. While it should match Source Definition 1:1 with all ports linked (to avoid future mistakes and misunderstandings), it reads only the data for ports linked to the next transformation. This can be observed while generating SQL statement (Properties->Sql Query->Generate SQL) - only the used ports will be listed in SQL statement. This is especially important for wide sources, as this way the amount of data transferred can be limited.
  2. Avoid using SQL override statements - Using the Generate SQL property overrides any and all other settings (i.e. port order, sorting, filters). This is also not visible at the first glance and reduces the code transparency. This should be therefore avoided. If you need to:
    1. sort the data - use "Number Of Sorted Ports" property with the proper ports order
    2. filter the data - use "Source Filter" property
    3. join multiple homogeneous sources - use "User Defined Join" property
SQL override should be used only when some complex statements are needed (e.g. use of DB-specific functions or subquery)
The above is also applicable (in general) to lookup transformation.

  1. All ports between Source Definition and SQ should be connected - As mentioned above, it is good to have SQ matching Source Definition as it reduces the number of future mistakes.
  2. All the datatypes should match Source Definition - In general implicit datatype conversion should be avoided as it may lead to errors and unexpected behavior.


Update Strategy:


  1. Don’t use DD_REJECT - Don't process unneeded data (unless really needed in .bad file, use "Forward Rejected rows"). Otherwise use a Filter transformation to drop unneeded rows of data.


Sorting: DB or Sorter transformation?

When considering performance, a case-by-case analysis is needed. There is no “best-for-all” approach. Database as well as network performance must be considered. Database and Integration Service load must be considered. Database setup must be considered. Finally, Database and Integration Service hardware configuration must be also considered. There are however some general rules that can be applied:

  1. Use DB if covering index is available - Try to take advantage of existing DB indexes. It happens quite often that the source tables are indexed and all that needs to be done is setting the appropriate order of the ports in Source Qualifier and the “Number Of Sorted Ports” property. If not:
  2. Check if it's possible to create covering index - Sometimes the DB index is not available, but can be created. The advantage of it is huge enough, that It’s always good to check. Of course there are scenarios, where you can’t afford to use an index (e.g. there are lots of inserts / updates on a table with rare selects). Investigate if it is possible - if not:
  3. Use Sorter transformation - If no index on source is available and data needs to be sorted, use a Sorter transformation. Keep in mind the remarks discussed above.

Thursday, June 20, 2013

Mapping Source-to-Target analyzer

Below is the original post decribing the early version and the main feature of the tool. For up to date description please visit XML Analyzer for Informatica Powercenter blog post.



Do these questions sound familiar?
  • "Hey, can you tell me where does this value come from?" 
  • "Do you know where does data for this column come from?"
  • "If I change values here, what columns will be affected?"
One thing great about PowerCenter is, it's very easy to trace column dependancies. You can quckly choose "Select Link Path" on desired column and trace it forward, backward or both, e.g:


Looks good. But what if you get more of those questions, like:
  • "Can you send me an email describing columns A, B, C, D, F?"
  • "Can you quickly send me some simple table with column dependancies?"
Select Link Path is extremly useful for development, but it is simply not possible to use it like 200 times if you need to create some document for large table. And what if there are more sources and targets?
One way is to use Metadata Manager. But what if it is not available? If you'd need something quick and handy, you can use:

Online Mapping XML Analyzer

It's a simple tool allowing you to create a Source to Target dependancy description in (almost) three clicks:
  1. Open the explorer
  2. Choose the mapping XML file
  3. Click "Upload"
As a result, you'll get a table as shown below:
In this example you can see all source columns listed (with column name, name of the source and source type) on the left with the corresponding target columns on the right. In addition you can see BusinessDate target column that is not being populated by any source column. It comes from an expression exp_Convert_Datatypes mentioned on the left. 

This way you can see:
  • all source columns and the target they affect (many to many relations supported),
  • all target columns with corresponding source (even if it is not being populated by source).
In addition, you can copy the whole table to Excel and do whatever you need to make it suit your needs (format, add filters, etc.):




Please keep in mind that it is a work-in-progress. Feel free to let me know if you face issues, but be advised that I cannot guarantee to fix them and I also hold no responsibility for using the tool.

I hope this will be helpful!

Monday, May 13, 2013

Stage Mapping Generator

The Problem

Recently, I've been given a sample flat file with a simple task: prepare a 1:1 mapping to load the file to database. Sounds easy? Sounds familiar? If it does sound familiar, than you probably know it's not easy. As I had to create the appropriate table, the problem was to decide what datatype, length and precision should I choose for each column. What if there's 80 columns? What if there's 10.000 rows in the flat file? Finally, what if you've got 160 files?

Than you've got a lot of work. This led me to a solution I'd like to share: 

Stage Mapping Generator

A simple tool, that would do the whole job. So here it is. All you need to do is put the flat files in one place:


Run the tool and import automaticaly created mappings:


Great, job done! Here is a sample mapping created with SMG - with all port names and datatypes resolved automatically and all string ports trimmed:



Configuration: what can you do?

To make the most out of it, the tool comes with a configuration file. It's possible to use it without any advanced setup, but there are some tricks it allows to make it more powerful. Here is the sample configuraion. Most of the parameters are self-explanatory:

#Full folder path containing source files. All the files from the location below will be analyzed and there will be a mapping created for each file found.
SourceDir: c:\SMG\FlatFiles

#Source file code page. You can set the proper code page used in source files.
SrcFileCodePage: windows-1250

#Full folder path for target xml files. Output XML file (or files) will be created there.
TargetDir: c:\SMG\xml

#Create one output file for all mappings [0/1]. Depending on choice made here there can be one XML file with all the content created (as depicted above) or there can be one XML per each source file created.
CombinedOutput: 0

#Column separator used in flat files:
ColumnSeparator: ,

#Thousand separator used in flat files:
ThousandSeparator: 

#Decimal separator used in flat files:
DecimalSeparator: .

#Import column names from the first row? [1/0]. If there is a header in each of the files, it will be used for the port names.
FirstRowHasNames: 0

#Number of rows used to determine datatype (0 for all). It's possible to limit the number of rows tested for each file. It can be used to limit the memory use and to increase performance.
DatatypeSampleCount: 0

#Trimm incoming strings [0/1]. Enable this if LTRIM/RTRIM should be added to all string columns.
AddTrimming: 1

#Convert DATE fields to date/time format ($$DateFmt parameter and file header  required!). Using this will cause all columns with "Date" in their names to be converted to datetime.
ConvertDates: 1

#Repository description - this can be obtained with exporting sample mapping from repository and checking XML file. This allows identifying appropriate folder.[Repository]
RepositoryVersion: 181.90
RepositoryDescription: <REPOSITORY NAME="DEV" VERSION="181" CODEPAGE="MS1250" DATABASETYPE="Microsoft SQL Server">
RepositoryFolderDescription: <FOLDER NAME="MYFOLDER" GROUP="" OWNER="" SHARED="SHARED" DESCRIPTION="" PERMISSIONS="rwx---r--" UUID="">

#Parameters and variables to be used in mapping:
#syntax: NAME:DATATYPE,DEFAULTVALUE,ISEXPRESSIONVARIABLE,ISPARAM,PRECISION,SCALE
#example: $$UserName:string,,NO,YES,50,0
#date example: $$Date:date/time,,NO,YES,29,9

[MappingVarParam]
$$BusinessDate: string,,NO,YES,12,0
$$DateFmt: string,MM/DD/YYYY,NO,YES,12,0

#Columns to be added to each source:
#Syntax: ColumnName: DATATYPE//PRECISION//SCALE//EXPRESSION
#Use // as separator
#Example: COL1: date/time//29//9//SUBSTR(Price,1,3)

#This section makes it possible to add any additional ports that should be added to each mapping.
[AdditionalColumns]
BusinessDate: date/time//29//9//TO_DATE($$BusinessDate,$$DateFmt)

Links to the tool:

Stage Mapping Generator DEMO
Stage Mapping Generator - full version

Final remarks

All targets created with the Stage Mapping Generator are MS SQL Server. However you can easily change the type on Target Developer if needed.

Feel free to contact me if you'd have any questions or comments!

Wednesday, May 8, 2013

[ODBC SQL Server Wire Protocol driver] Invalid cursor state.

Using Stored Procedure as a aource of data can be a litlle bit tricky. I've created a simple mapping with Stored Procedure invoked in Sql Query of Source Qualifier:

I've created a worflow with 5 sessions - each invoking one of below stored procedures:


------> SCENARIO 1 <------
CREATE PROCEDURE sp_test_mg
AS
BEGIN


DECLARE @result TABLE (batchFile varchar(255), ClientName varchar(255))


INSERT INTO @result
SELECT TOP 10 Batchfile, ClientName FROM test_table


SELECT * FROM @result
END
GO


------> SCENARIO 2 <------
CREATE PROCEDURE sp_test_mg_cur
AS
BEGIN


DECLARE @result TABLE (batchFile varchar(255), ClientName varchar(255))


DECLARE @batchFile varchar(255), @ClientName varchar(255)


INSERT INTO @result
EXEC sp_test_mg


DECLARE test_cursor CURSOR STATIC FOR
SELECT * FROM @result;

OPEN test_cursor
FETCH NEXT FROM test_cursor INTO @batchFile, @ClientName
WHILE @@FETCH_STATUS = 0
BEGIN
SELECT @batchFile, @ClientName
FETCH NEXT FROM test_cursor INTO @batchFile, @ClientName
END
CLOSE test_cursor
DEALLOCATE test_cursor


END
GO


------> SCENARIO 3 <------
CREATE PROCEDURE sp_test_mg_tab
AS
BEGIN
SELECT TOP 10 Batchfile, ClientName FROM test_table
END
GO


------> SCENARIO 4 <------
CREATE PROCEDURE sp_test_mg_tab_var
AS
BEGIN
DECLARE @result TABLE (batchFile varchar(255), ClientName varchar(255))


INSERT INTO @result
EXEC sp_test_mg


SELECT * FROM @result
END
GO


------> SCENARIO 5 <------
CREATE PROCEDURE sp_test_mg_temp_tab
AS
BEGIN


CREATE TABLE #result (batchFile varchar(255), ClientName varchar(255))


INSERT INTO #result
EXEC sp_test_mg


SELECT * FROM #result
END
GO


Now all of executions except for solution number 3 fail with an error message: [ODBC SQL Server Wire Protocol driver] Invalid cursor state. What is the issue?

My first thought was that for some reason PowerCenter can't read from temporary structures. The 3rd scenario is the only one using real table. But the reason - and the solution is totally different: SET NOCOUNT ON!

Adding SET NOCOUNT ON to each stored proc makes everything work fine, with one little remark: in Scenario 2 (with the coursor) only one row is read.

Wednesday, April 10, 2013

XML writer: failed to convert data

A session succeeded, but I've found a rejected row:


Let's check the log:


Doesn't mean much to me. But there is a strange thing is in the first picture: there is a row being rejected, but there were no rows fetched from the source. It seems that if you don't have any input for an XML Generator transformation, it will try to create XML nevertheless. And it will result in the strange error if one of the input fields is of datetime type. Since there is no input at all, it tries to convert [0000-00-00] to date.

Friday, March 29, 2013

Adding a port to an existing XML PARSER

How to add a port to an existing XML PARSER transformation? 


Usually they say:
- go to XML Editor
- create XSD
- edit the file in some external editor
- reimport the file creating a new XML Parser Transformation.

This works well, with just one little disadvantage: you loose all links. It's ok if you've got five fields. But can be a real headache if the XML structure is complex.

So lets rephrase the question:

How to add a port to an existing XML PARSER transformation without loosing all the existing links?

You begin with the same steps:

  • go to XML Editor:
    • edit the XML PARSER transformation 
    • go to "Midstream XML Parser" tab
    • click the "XML Editor" button
This should bring up the XML editor:
  • create XSD file
    • choose View XML Metadata as XML/DTD/XSD (indicated above) to bring up the following screen
    • Clicking on the namespace should bring up the file in editor (e.g. <noNamespace>)
  • edit the file in some external editor and add the required port, e.g.:
<xsd:attribute name="NewTestPort" type="xsd:string">
</xsd:attribute>

Now, go back to the mapping, right-click the existing XML PARSER and choose "Synchronize XML Definition" and point the new XSD:


Voila! Job done.

Great! But the port is still not there... Now what?

How to add a port after XML Definition has been synchronized?

First, let's take a look into XML Editor. The NewTestPort can be found in the Navigator (the left side), but it is not visible in XML View (the right side):

Right clicking it on the Navigator window (the left side, where it is visible) shows the "Show XPath Navigator" command. It does bring the XPath Navigator indeed:
The problem is that you can't drag&drop the port to the view on the right. You need to bring up the XPat Navigator but from the right pane, focusing on the appropriate branch in the XML View:

This will bring up the same XPath Navigato window, but this time allowing you to drag&drop the new added column. Now after choosing "Apply Changes" forom "File" menu (or simply clicking Ctrl+S) and closing the XML Editor, the newly added NewTestPort should be visible in the transformation:

And - most importantly - no existing links are missing! Voila! Job done. This time it's really done :)