DWBI-TECH BLOGS (Pradeep Kannadiga): August 2015

Monday 17 August 2015

Difference between native and hive mode in Informatica big data edition (BDE)

Difference between native and hive modes in Informatica big data edition BDE :

a) Native mode- In native mode BDE works like a normal power center. This can be used to read /wrtie to traditional RDBMS databases. It can also be used to write to HDFS and Hive. It works like a power center because the execution of the mapping logic happens on the power center server. i.e. that source data is read to informatica server , transformations applied and then data is loaded to the target.

This mode is stateful i.e you can keep track of data from previous records, use sequence generators, sorters , etc just like in normal power center.

b) Hive mode- In Hive mode , like in native you can have similar source and targets however the whole mapping logic is pushed down to hive i.e. the hadoop cluster. The Informatica BDE in this mode coverts the mapping logic into hive SQL queries and executes it directly on the hadoop cluster as Hive queries there by converting them all into map reduce jobs.

This mode is not stateful i.e., you cannot keep track of dataa in the previous records using stateful variables. Your transformations like sorters, sequence generators wont work fully or properly.

Your update strategy transformation will not work in hive mode just because hive does not allow updates. You can only insert records to Hive database.

In this mode the data gets read from source to temporary hive tables , transformed , and the target also gets loaded to temp hive tables before being inserted to final target which can be RDBMS database like oracle or Hive database. Hence the limition of hive also follows on to Hive mode in Informatica BDE.

However. if your volume of data is huge and you want to push all the processing to hive then Hive mode is a better option. There are workarounds to do type 2 kind of updates in Hive mode.

Thursday 6 August 2015

What is Informatica big data edition (BDE) ?

What is Informatica big data edition BDE ?

Informatica big data edition BDE is a product from Informatica Corp that can be used like an ETL tool for working in hadoop enviroment along with traditional RDBMS tools. Now there are lot of ETL products in the market that makes it easier to integrate with hadoop. To name a few talend, pentaho, etc. Informatica is one of the leading ETL tool vendor and Informatica power center tool is very famous as an ETL tool and has been for many years. Traditionally this tool was used to extract transform and load data to traditional databases such as oracle, sql server, netezza to name a few. With advent of hadoop for storing peta byte volumes of data, building ETL tools that can work with hadoop became more important. It requires a lot of handcoding and knowledge to work directly with hadoop and build map reduce jobs. Hadoop tools such as hive made it easier to write SQL queries on top of Hive database and convert it to map reduce jobs. Hence, lot of companies started using Hive as a data warehouse tool and storing data in hadoop just like traditional databases and writing queries on Hive. How do we now extract , transform, and load the data in Hadoop? Thats where Informatica BDE comes into picture. It is a tool that you can use for ETL or ELT on hadoop infrastructure. Informatica BDE can run in two modes. They are native mode and hive mode.

In the native mode, it runs as a normal powercenter but in hive mode you can push down the whole mapping logic to hive and make it run on the hadoop cluster there by using the parallelism provided by hadoop. However there are some limitation when running in hive mode but that is more because of the limitations from hive itself. For example, hive does not allow updates in older versions.

With Informatica BDE you can do the following at a very high level :

a) Just like any ETL tool you can do extract , transform, load between tranditional rdbms or hive/hdfs source and targets.
b) Push the whole ETL logic to hadoop cluster and make use of the map reduce framework. Basically it makes building hadoop jobs easier.
c) Makes it easy to create connection to all the different sources and integrate data from those sources.
d) It makes it easier to ingest complex files such as JSON, XML, Cobol, AVRO, Parquest , etc.

Informatica BDE uses the Informatica developer interface to build the mappings , deploy and create applications. Anyone who has used IDQ before might be familiar with the Informatica developer interface. Informatica BDE can be found in Informatica 9.6 versions onwards.