Friday, 18 September 2015

Creating a table in Parquet, Sequence, RCFILE and TextFile format and enabling compression in Hive

Creating a table in Parquet, Sequence, RCFILE and TextFile format in Hive.


My intention was to write an article of different file formats in Hive but happened to notice a article already posted. Check the link below for the difference in each file format in Hive.

http://www.inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore

Also check the article below:

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v1/v1-0-1/Installing-and-Using-Impala/ciiu_file_formats.html?scroll=file_formats
 
We are using parquet these days apparently because of the compression options and performance with large tables in Hive. Parquet is optimized to work with large data sets and provide good performance when doing aggregation functions such as max or sum.

If you prefer to see the HDFS file in clear text format then you need to store the file in textfile format. This apparently takes more space than the binary formats supported by sequence and rcfile.


Below are examples for creating a table in Parquet ,Sequence, RCfile, TextFile format: Location is optional.

create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS parquet location 'hdfspathname';


create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS sequencefile location 'hdfspathname';


create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS rcfile location 'hdfspathname';


create table IF NOT EXISTS mytable
(
id int,
age smallint,
name string,
joining_date timestamp,
location string,
roweffectivedate string, batchid int
) STORED AS textfile location 'hdfspathname';


The default storage format is apparently is a comma-delimited TEXTFILE.

Enabling compression in Hive or Impala table:


Apart from the storage format supported in Hive, the data can also be compressed using codecs such as LZO, Snappy, etc. To enable compression in Hive, we use set statements such as below. The below example is for Lzo compression. After the set statements you can use insert statements to insert data to a table and the data will be compressed as per the set statement.



set mapreduce.output.fileoutputformat.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;

insert overwrite table tablename partition(dateid) select * from table2;

Check the below article for more information:



0

Monday, 14 September 2015

Why a career in Data warehousing ? Best jobs and salary in Data warehousing.

Why a career in Data warehousing?


Firstly, what is data warehousing? Data Warehousing (DW) is a term used in Information technology to refer to data repositories created to store historical data that is mostly used for reporting and analytical purposes. This could be an enterprise data warehouse that has data integrated from various sources, or it could be subject specific data repositories called the data marts.  This is just a small definition of data warehousing. 


These days data warehousing is gaining more popularity and attention thanks to Big data and other high profile web service technologies. Big data is a technology that allows huge volume of data to be processed at lower cost for analytical purpose. Big companies like Facebook, google, twitter, and yahoo have invested in these technologies and using them for processing the huge amount of data that they gather on daily basis. These days cloud services have become popular and are allowing huge data warehouses to be hosted on the cloud. Amazon web services, Microsoft azure to make a few are gaining popularity. With these kind of technologies making news and gaining popularity, data warehousing is becoming more exciting and a place where lot of IT jobs are getting created and bringing in talented IT professionals. 

Having worked in Data warehousing from last 10 years, I have seen increased interest in companies investing in data warehousing and in general increase in interest among ITfolks in data warehousing related technologies. I'm writing this article to introduce people to data warehousing and let them know about some of the jobs in data warehousing. 

Best jobs and salary in Data warehousing.

Enterprise Data Architect/ ETL Architect/ Solution Architect / Hadoop Architect - 
As the name implies these guys are responsible for architecture. This could be enterprise architecture which means they decide how all the pieces of the data warehouse work together, which products need to be used, interactions between different systems, etc. ETL architect could be more involved with data integration and creation of data warehouses. Hadoop architects are more concerned about the big data architecture in an organization. Architects also work with upper management and business teams to provide optimum data warehousing solutions. 

Estimated Salary: 80,000 to 140,000+

Team/Project/Account Managers -
As with any team , there will be managers in data warehousing teams too. These guys usually are expected to have some experience with data warehousing. Team managers are more involved in resourcing and budgets etc. Project managers are more involved in managing a projects till the end. Account managers are involved if the company is a vendor providing services to another organization.

Estimated Salary: 80,000 to 120,000+ 

Big data / Hadoop developers and Admins-
The big data developers are involved in developing Hadoop jobs which could be writing map reduce jobs, developing hive , pig , oozie scripts or working on complex apache spark code. The Hadoop admins are involved in managing the Hadoop infrastructure, fine tuning it for performance, user administrator, security, etc.

Estimated Salary: 60,000 to 100,000+ 

Data modellers -
In some of the organization the data modelling can be done by data architects or even ETL developers. Some organizations also have different people for doing data modelling. Data modelling refers to developing all the table structure and defining the relationships between them.

Estimated Salary: 60,000 to 100,000+ 

ETL/Data warehouse Developer -
ETL stands for Extract, transform, and load and ETL professionals create jobs to integrate data from multiple sources, transform and load it to relational or flat file targets. These days ETL developers also work ETL tools created for big data platforms. There are many big vendors in this area and Informatica, Pentaho are some of the big names.

Estimated Salary: 70,000 to 100,000+ 

Report Developer -
Report developers build various reports and dashboards for the business teams using tools developed by vendors such as Cognos, Business Objects, Microstrategy, etc. Report developers need to know SQL and proficiency with any of the report development tools.


Estimated Salary: 70,000 to 100,000+ 


Tester/ Quality Assurance -
Well every field in software development needs a tester to make sure the software is working as per requirement. Testers do all the testing, write test cases, etc. Data warehouse testing might need testers who know a bit of ETL tools, SQL, and some data warehousing concepts.

Estimate Salary - 60,000 - 100,000+

Database Administrators -
Database administrators are responsible for managing and administering the databases and servers that are used to store the data in data warehousing. There are many jobs or oracle , sql server, netezza, etc kind of databases.

Estimated Salary: 70,000 to 100,000+ 

Production and Infrastructure Support - 
Production support folks are responsible for managing and supporting all the jobs in production. Only these guys have access to production servers and make sure all the production jobs are running smoothly. They have to carry pagers and sometimes have to work overnight shifts.

Estimated Salary: 50,000 to 80,000+ 










0