DWBI-TECH BLOGS (Pradeep Kannadiga): July 2016

Friday 8 July 2016

Performance tuning of Informatica Big Data Edition Mapping

Below are the list of performance tuning steps that can be done in Informatica Big Data Edition:

1) When using a look up transformation only when the lookup table is small. Lookup data is copied to each node and hence it is slow.

2) Use Joiners instead of lookup for large data sets.

3) Join large data sets before small datasets. Reduce the number of times the large datasets are joined in Informatica BDE.

4) Since Hadoop does not allow updates, you will have to rebuild the target table whenever the record is updated in a target table. Instead of rebuilding the whole table, consider rebuilding only the impacted partitions.

5) Hive slower with any non string data type. It needs to create temp tables to do the conversion to and from the non string data type to string data type. Use non string data type only when required.

6) Use the data type precision close the actual data. Using higher precision slows down the performance of Informatica BDE.

7) Map only the ports that are required in the mapping transformation or loaded to target. Less number of ports means better performance and less data reads.

Workarounds for Mapping variables and parameters and sequence generators and sorters in BDE

Mapping variables and parameters and sequence generators and sorters in BDE

Since there are no mapping variables/parameters and sequence generators in BDE, you can use the following workarounds:

For mapping variables and parameters, you can use a control table or a files instead and read the control tables or files in the mapping and use the content of the table/files in your mapping. Create a look up on the control table to get values for all the parameters defined in the control table. You can update the control table if the parameter needs to be updated at the end of the run.

For Sequence generator, you can use UUID (Unique Identified) functions instead. These UUID functions are alphanumeric and if you need numeric only then use Java functions.

HSQL does sorting by default. i.e Hadoop does the sorting and so you do not need sorter unless you are using it with a aggregator that has a sorted input. In this case you need to add a sorter to validate the mapping.

Thursday 7 July 2016

Useful Queries for troubleshooting amazon redshift

USEFUL QUERIES FOR TROUBLESHOOTING IN AMAZON REDSHIFT

Here are some of my queries for troubleshooting in amazon redshift. I have collected this from different sources.

TO CHECK LIST OF RUNNING QUERIES AND USERNAMES:

select a.userid, cast(u.usename as varchar(100)), a.query, a.label, a.pid, a.starttime, b.duration,
b.duration/1000000 as duration_sec, b.query as querytext
from stv_inflight a, stv_recents b, pg_user u
where a.pid = b.pid and a.userid = u.usesysid

select pid, trim(user_name), starttime, substring(query,1,20) from stv_recents where status='Running'

TO CANCEL A RUNNING QUERY:

cancel <pid>

You can get pid from one of the queries above used to check running queries.

TO LOOK FOR ALERTS:

select * from STL_ALERT_EVENT_LOG
where query = 1011
order by event_time desc
limit 100;

TO CHECK TABLE SIZE:

select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema,
trim(a.name) as Table, b.mbytes, a.rows
from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
order by b.mbytes desc, a.db_id, a.name;

TO CHECK FOR TABLE COMPRESSION:

analyze <tablename>;

analyze compression <tablename>;

TO ANALYZE ENCODING:

select "column", type, encoding
from pg_table_def where tablename = 'biglist';

TO CHECK LIST OF FILES COPIED:

select * from stl_load_errors

select * from stl_load_commits

select query, trim(filename) as file, curtime as updated, *
from stl_load_commits
where query = pg_last_copy_id();

TO CHECK LOAD ERRORS

select d.query, substring(d.filename,14,20),
d.line_number as line,
substring(d.value,1,16) as value,
substring(le.err_reason,1,48) as err_reason
from stl_loaderror_detail d, stl_load_errors le
where d.query = le.query
and d.query = pg_last_copy_id();

TO CHECK FOR DISKSPACE USED IN REDSHIFT:

select owner as node, diskno, used, capacity
from stv_partitions
order by 1, 2, 3, 4;
select query, trim(querytxt) as sqlquery
from stl_query
order by query desc limit 5;

SOME IMPORTANT AWS COMMANDS:

To resize the redshift cluster (node type and number of nodes always required):

aws redshift modify-cluster --cluster-identifier <cluster name> --node-type dw2.8xlarge --number-of-nodes 3

To get filelist on S3:

aws s3 ls $BUCKET/ > ./filecount.out

To get status of cluster and other information of cluster in text format:

aws redshift describe-clusters --output text