Thursday, 25 August 2016

Improving performance of Inserts and deletes in SQL server

Improving performance of Inserts and deletes in SQL server  - Microsoft Azure

I tried to delete some records from a table with 100 millions records and it took forever. One of the reason being there was no index on the column I used in the where clause of the delete statement and other reason being the table was so big. So deleting a big chunk of the data from such big tables can take forever since the data gets written to transaction logs and the table has to scan for all the records that match the where clause condition.

How to delete records quickly from huge tables in sql server on Azure? 

1) Use only the column that has index defined in the where clause condition. May be this will help find the records to be deleted quickly.
2) Delete records in small chunks using the loop below. This prevents transaction logs from growing rapidly.  @@rowcount always returns the number of records affected from delete.

DELETE TOP (100000) from dbo.mytable WHERE name='dwbi'
WHILE @@rowcount > 0
  DELETE TOP (100000) from dbo.mytable
    WHERE name='dwbi'
3)  Insert data into temp table with no index on it using the columns that need not be deleted , delete the original table, rename the temp table back to the original table and add back the index on the newly renamed temp table.

select * into tmp_mytable from mytable where name!='dwbi'
drop mytable
sp_rename 'tmp_mytable', 'mytable'

How to insert records quickly from huge tables in sql server on Azure? 

My records to insert records into a huge table improved a lot after I disabled the non clustered indexes and rebuild it after the insert. The steps are below:

Insert into [dbo].[mytable] select * from [dbo].[stg_table]



Friday, 8 July 2016

Performance tuning of Informatica Big Data Edition Mapping

Performance tuning of Informatica Big Data Edition Mapping

Below are the list of performance tuning steps that can be done in Informatica Big Data Edition:

1)  When using a look up transformation only when the lookup table is small. Lookup data is copied to each node and hence it is slow.

2) Use Joiners instead of lookup for large data sets.

3) Join large data sets before small datasets. Reduce the number of times the large datasets are joined in Informatica BDE.

4) Since Hadoop does not allow updates, you will have to rebuild the target table whenever the record is updated in a target table. Instead of rebuilding  the whole table, consider rebuilding only the impacted partitions.

5) Hive slower with any non string data type. It needs to create temp tables to do the conversion to and from the non string data type to string data type. Use non string data type only when required.

6) Use the data type precision close the actual data. Using higher precision slows down the performance of Informatica BDE.

7) Map only the ports that are required in the mapping transformation or loaded to target. Less number of ports means better performance and less data reads.


Workarounds for Mapping variables and parameters and sequence generators and sorters in BDE

Mapping variables and parameters and sequence generators and sorters in BDE

Since there are no mapping variables/parameters and sequence generators in BDE, you can use the following workarounds:

For mapping variables and parameters, you can use a control table or a files instead and read the control tables or files in the mapping and use the content of the table/files in your mapping. Create a look up on the control table to get values for all the parameters defined in the control table. You can update the control table if the parameter needs to be updated at the end of the run.

For Sequence generator, you can use UUID (Unique Identified) functions instead. These UUID functions are alphanumeric and if you need numeric only then use Java functions.

HSQL does sorting by default. i.e Hadoop does the sorting and so you do not need sorter unless you are using it with a aggregator that has a sorted input. In this case you need to add a sorter to validate the mapping.

Thursday, 7 July 2016

Useful Queries for troubleshooting amazon redshift


Here are some of my queries for troubleshooting in amazon redshift. I have collected this from different sources.


select a.userid, cast(u.usename as varchar(100)), a.query, a.label,, a.starttime, b.duration,
b.duration/1000000 as duration_sec, b.query as querytext
from stv_inflight a, stv_recents b, pg_user u
where = and a.userid = u.usesysid

select pid, trim(user_name), starttime, substring(query,1,20) from stv_recents where status='Running'


cancel <pid>

You can get pid from one of the queries above used to check running queries.


select * from STL_ALERT_EVENT_LOG
where query = 1011
order by event_time desc
limit 100;


select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema,
trim( as Table, b.mbytes, a.rows
from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a
join pg_class as pgc on pgc.oid =
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on
order by b.mbytes desc, a.db_id,;


analyze <tablename>;
analyze compression <tablename>;


select "column", type, encoding
from pg_table_def where tablename = 'biglist';


select * from stl_load_errors

select * from stl_load_commits

select query, trim(filename) as file, curtime as updated, *
from stl_load_commits
where query = pg_last_copy_id();


select d.query, substring(d.filename,14,20),
d.line_number as line,
substring(d.value,1,16) as value,
substring(le.err_reason,1,48) as err_reason
from stl_loaderror_detail d, stl_load_errors le
where d.query = le.query
and d.query = pg_last_copy_id();


select owner as node, diskno, used, capacity
from stv_partitions
order by 1, 2, 3, 4;
select query, trim(querytxt) as sqlquery
from stl_query
order by query desc limit 5;


To resize the redshift cluster (node type and number of nodes always required):

aws  redshift modify-cluster --cluster-identifier <cluster name> --node-type dw2.8xlarge --number-of-nodes 3

To get filelist on S3:

aws s3 ls $BUCKET/  > ./filecount.out

To get status of cluster and other information of cluster in text format:

aws redshift describe-clusters --output text  


Thursday, 28 April 2016

How to run stored procedure in Informatica cloud

How to run stored procedure in Informatica cloud?

There are multiple ways to run stored procedure in Informatica cloud.

If you want to call a stored procedure by passing an input value and reading the output from the stored procedure and process it in a mapping then it is better to use SQL transformation in Informatica cloud mapping designer as shown in picture below:

If you want to read the output of a stored procedure as a source in a mapping then you can call the stored procedure a query in the source as shown in screenshot below. The exec sp_help stored procedure is called in the source query. The stored procedure should return some data for this method to work. You might have to define the output variables in the stored procedure.

If you want to just execute a stored procedure once before or after executing a mapping then you can call the stored procedure a pre or post processing command in the mapping configuration as shown in the screenshot below. SQLCMD of sql server is used to execute the stored procedure. The output of the stored procedure is written to an output file. -b flag is used to return error code and make the preprocessing command fail when the stored procedure fails. Use session property  Preprocessing command failure and set it to stop if you want failure to stop the processing of other jobs.


Tuesday, 29 March 2016

Importing Cobol copy book definition , and reading and troubleshooting cobol file issues in Informatica

Importing Cobol copy book definition in Informatica and troubleshooting cobol file issues

 To read a cobol file, first you need to create the source definition and then read the file the data from the cobol input file. Source definitions are created from cobol copy book definitions.  Cobol copy book has definition of all the fields present in the data file along with the level and occurence information.

How to import a cobol copy book in Informatica?

Create source definition using Source-> Importing From Cobol  file option to create the source definition from cobol copy book.

Insert the source definition into mapping. Instead of source qualifier it is going to create normalizer transformation.Read from the normalizer and write to any target.

Details on how to load data from cobol copy book is in this video that I found on youtube: 

Troubleshooting copy book import in informatica:

Reading from cobol file is not always straightforward. You will have to sometimes edit the copy book to get it working. Some of the messages I had while creating source definitions are below:

Error messages seen:

05 Error at line 176 : parse error

Error: FD FILE-ONE ignored - invalid or incomplete.

05 Error: Record not starting with 01 level at line 7

Resolution for copy book import issues In Informatica:

1) Add the headers and footers shown in blue below even if they are blank

        identification division.
        program-id. mead.
        environment division.
        select file-one assign to "fname".
       data division.
       file section.
       fd  FILE-ONE.

        01  HOSPITAL-REC.
           02  DEPT-ID                PIC 9(15).
           02  DEPT-NM                PIC X(25).
           02  DEPT-REC                OCCURS 2 TIMES.
               03  AREA-ID             PIC 9(5).
               03  AREA-NM             PIC X(25).
               03  FLOOR-REC       OCCURS 5 TIMES.
                   04  FLOOR-ID       PIC 9(15).                                
                   04  FLOOR-NM       PIC X(25).
      working-storage section.
       procedure division.
           stop run.

2)  All the fields should start from column 8. Create 8 spaces before each field.
3) Make sure the field names do not start with *. It has to have a level number at the beginning.
4) Level 01 has to be defined as the first record. If level 01 not there then it would not import the cbl file.
5) Special levels such as 88 , 66 might not get imported. This apparently is limitation of powercenter. I could not find a way around it. 

Resolution for cobol file issues In Informatica:

I had such a hard time reading a cobol file that was in EBCDIC format with comp-3 fields in it. I thought the issues are mostly in importing the cobol copy book but even more issues when reading the cobol file. It is supposed to be straight forward in Informatica powercenter. The list of issues I faced.

1) When reading from cobol mainfram file the comp-3 fields were not read properly.
2) The alignments of the fields were not correct. The values were moving to next columns and had a cascading effect.

To resolve the above two issues I asked the mainframe file to be sent in binary format. Apparently files ftp'ed in ascii format from mainframe causes all the above issue. The comp-3 fields are not transferred properly in Ascii mode of ftp. 

Now with the binary file and below properties shown in screen it was supposed to work correctly. It was not. It was all garbage because of a weird issue. The properties I set for code page in source properties in source designer was for some reason not getting reflected in the session properties for that source. I had to manually change in the session- mapping -> source- file properties -advanced tab to code page IBM EBCDIC US English IBM037 and Number of Bytes to skip between records to 0 and it finally worked.

 To summarize when reading cobol file:
a) Ask for the file to be ftp'ed in binary format from the mainframe.
2) Set the code page and number of bytes to skip appropriately as shown in the screenshot in the session properties.

This should solve most of your problems with comp-3 fields and alignments ,etc. 

If not try unchecking IBM comp-3 option in the source properties from source designer. Try changing the code page and number of bytes to skip,  etc and hopefully it works.    

If you are seeing dots at the end of some comp-3 fields then most likely you are writing numeric data into varchar field. If you write it to numeric fields in the database those dots should disappear. 

Another issue with Informatica power center is that it can read only fixed width files. if you have occurs section with depending on condition then most likely you have a variable length file and Informatica power center does not work correctly with variable length files. Ask the mainframe guys to convert it to fixed width files. 



Friday, 4 March 2016

Mapping variables and parameters in Informatica Mapping and session

Mapping variables and parameters in Informatica Mapping

1) Informatica Mapping Parameters:

You use mapping parameters to supply parameter values to informatica mappings. It could be values such as Batch_Key, Batch_Name etc.The value of the mapping parameters do not change for the entire execution time of a session. The mapping parameters can be defined in the mapping as shown in the screenshot below in the mapping designer.The same mapping parameters needs to be defined in the parameter file too.

For example, the parameter file would have:


In the mapping designer, you would define the $$BATCH_KEY in the mapping and parameters window. During execution the session will associate the value 301 where it finds $$BATCH_KEY in the mapping. It might also be necessary to define the same parameter in the worflow under the variables tab.

2) Informatica Mapping Variables:

Mapping variable represents a value that can change through the session. Mapping variables can also be defined under the mapping and parameters window of the mapping designer just like parameters (shown in screenshot above). It can also be initialized using the parameter file. However the values of a variable can be changed during execution. The value of the mapping variables can be changed in an expression transformation using setvariable functions. For example:

SETVARIABLE($$BATCH_KEY, 302)  would change the value of $$BATCH_KEY to 302 during execution.

SETVARIABLE function can be used in expression, filter, router, and update strategy.Integration Service saves the value of a mapping variable to the repository at the end of each successful session run and uses that value the next time you run the session.

Postsession and presession variable assignment in session can be used to pass the value of the mapping variable to the session/workflow variables.This allows using the values of the variable at workflow level as shown in screenshot below.   

3)Informatica Session Parameters:

Session parameters represent values that can change between session runs, such as database connections or source and target files. E.g.: $DBConnectionName, $InputFileName, etc. These paramters needs to defined in the parameter file.

4)Informatica Workflow variables:

All the parameters, variables that have to be used in the workflow needs to be defined in the workflow in the window below. This can be all the mapping parameters, session parameters, etc.


Monday, 22 February 2016

Pushdown Optimization in Informatica Power Center and Big data edition

Pushdown Optimization in Informatica Power Center and Big data edition

Pushdown optimization is a feature in Informatica which enables the Informatica mapping to be converted into a SQL code and executed on the database. It is useful when you are dealing with huge volumes of data from the source and instead of reading all the data to the Informatica server it might be better to do some of the transformation logic such as filtering, aggregation, joins on the database itself. Also, useful in cases when you have powerful data warehouse appliance such as Netezza or Teradata.


Friday, 12 February 2016

What is informatica cloud services?

What is Informatica cloud services ?

Informatica cloud service is an Informatica product that is web based and on the cloud solution for data integration. The mappings and any data integration tasks can be created on the web based tool and executed/monitored from the web based tool. A secure agent is installed on the local environment that allows reading/writing data  to/from the cloud. Connections can be created to database such as amazon redshift, sql server, oracle, netezza etc and network folders. This product allows ETL tasks to be created on the web without having a power center server location on your premise. Hence, providing a cheaper option for creating Informatica tasks.

The different components of the Informatica cloud service (ICS) tool sets are listed below:

Data synchronization and Data replication: 

As the name implies it allows database synchronization. Data is copied from the source to the targets along with application of any data filters. One table or multiple tables can be synchronized. ICS provides inbuilt connectors using which connections to the source and target tables can be created.

Mapping configuration:  

Allows a task to be created from a mappings. The parameters files, variables, post processing commands, sessions settings etc can be defined for the task. This is similar to the session in the informatica power center.

Power Center:  

This component allows Informatica power center workflows to be imported and executed from the Informatica cloud service. The source and target connections from the power center workflow can be mapped to the connections available on Informatica cloud service. When importing informatica power center workflows make sure the workflow is exported from the repository manager of power center. Otherwise it shows some error while importing the task.


This component allows to create informatica ETL mappings similar to Informatica power center. Not all the features of Informatica power center is available in Informatica cloud service. Features like SQL transformation, union, etc are not provided in ICS. If you need additional features then you might have to create a mapplet in power center and import that task as a mapplet from powercenter and add it to the ICS mapping. ICS provides transformations such as expression, joiner, filter, lookup, sorter, aggregator, mapplet, and normalizer.

Task Flows

Task flows allows to create a sequence of jobs to be executed. All the mapping configuration tasks created can be made to execute in a sequence. This is similar to the workflow in informatica power center.

Integration templates

This component allows Informatica mapping templates created using Microsoft visio to be imported and applied to mappings in ICS.

Activity Log/Monitor

This component allows to monitor all the executing tasks and as well see the completed tasks. It provides information on the number of source and target rows, session logs, etc.


ICS mapplets work similar to Informatica power center mapplets. Mostly this allows powercenter mapplets to be imported to ICS.


 This components allows connections to be created to flat files folders, databases, etc. Wide range of connectors are available.

Friday, 11 December 2015

SQL override in Source qualifier and look up overide in Informatica

SQL override in Source qualifier and look ups in Informatica

SQL override is commonly used by ETL developers in Source qualifiers and Lookup transformation. To override SQL in source qualifier you can generate the SQL from the properties tab of the source qualifier and do all the necessary modifications to the SQL. All you have to keep in mind is the columns in the select statement match the order of the ports in the source qualifier. The ports have to be mapped to the transformation next to the source qualifier. The diagram below shows a simple SQL override. Similarly SQLs can be override in the lookups too.

The advantages of SQL override in Informatica:

a) You can put a lot of logic in the SQL override. The SQL will be executed in the source database and the results will be pulled in by the Informatica powercenter. Instead of reading all the records from the table, by writing SQL override you can filter, aggregate, etc on the source database and pull in the result set.

b) Similarly in the Lookups instead of reading all the columns from a table, you can read only the required columns from the table by doing SQL override and removing all the unnecessary ports. Also you can aggregate , filter, etc on the records that needs to be read.

c) If you are comfortable with SQL you would prefer this path instead of implementing a lot of stuff that can be achieved with SQLs.