With the advent of various technologies the data is growing exponentially. This exponential growing data with various variety, huge volume, different veracity and value is categorized into 3 types namely Structured Data, Semi Structured Data, quasi structured and Unstructured Data. Structured Data is nothing but data that can be stored in databases, for instance, the transaction records of any online purchase that you make can be stored in a database whereas data that can only be partially stored in the database is referred to as semi structured data, for instance, the data that is present in the XML records can be stored partially in the database. Any other form of data that cannot be categorized as Structured or semi-structured is referred to as Unstructured Data, for instance, the data from Social Networking websites or the web logs which cannot be analyzed or stored for processing in the databases are examples of unstructured data. We generally refer to Unstructured Data as “Big Data” and the framework that is used for processing Big Data is popularly known as Hadoop.
Apache Hadoop is an excellent framework for processing, storing and analyzing large volumes of unstructured data known as Big Data. Hadoop technology is the buzz word as deals with big data that runs into Tera bytes, petabytes, and zeta bytes these days with various key components that comprise the Hadoop Ecosystem. It typically serves two purposes:
- Storing enormous amounts of data: This is achieved by partitioning the data among several nodes.
Block-size in Hadoop File System is also much larger (64 or 128 MB) than normal file-systems (64kb).
- Bringing computation to data: Traditionally, data is brought to clients for computation.
But data stored in Hadoop is so large that it is more efficient to do the opposite.
This is done by writing map-reduce jobs which run closer to the data stored in the Hadoop.
Hadoop Ecosystem comprises of many components like Map Reduce Framework,HDFS (Hadoop Distributed File System),Hive, HBase, Pig,Flume,Sqoop,Oozie, Zoo Keeper, Ambari,Avro,Mahaout,HCatalog, Storm, Big Top , Solr Lucene ,Spark. Among all the two major key components of Hadoop Ecosystem are Hive and Pig.
- HIVE Hadoop
Hive Hadoop was founded by Jeff Hammerbacher who was working with Facebook. When working with Facebook he realized that they receive huge amounts of data on a daily basis and there needs to be a mechanism which can store, mine and help analysis of the data. This idea to mine and analyze huge amounts of data gave birth to Hive. It is Hive that has enabled Facebook to deal with 10’s of Terabytes of Data on a daily basis with ease.
About Hive: Hive is similar to a SQL Interface in Hadoop, Hive select, where, group by, and order by clauses are similar to SQL for relational databases. Hive loses some ability to optimize the query, by relying on the Hive optimizer. The data that is stored in HBase component of the Hadoop Ecosystem can be accessed through Hive. Hive is of great use for developers who are not well-versed with the MapReduce framework for writing data queries that are transformed into Map Reduce jobs in Hadoop.
Hive is considered as a Data Warehousing package that is constructed on top of Hadoop for analyzing huge amounts of data. Hive is mainly developed for users who are comfortable in using SQL. The best thing about Hive is that it conceptualizes the complexity of Hadoop because the users need not write MapReduce programs when using Hive so anyone who is not familiar with Java Programming and Hadoop API’s can also make the best use of Hive.
Hence Hive Hadoop in points can be summarized as:
- A Data Warehouse Infrastructure
- Definer of a Query Language popularly known as HiveQL (similar to SQL)
- Provides us with various tools for easy extraction, transformation and loading of data.
- Hive allows its users to embed customized mappers and reducers.
Hive Hadoop is very much popular because of following reasons:
- Hive Hadoop provides the users with strong and powerful statistics functions.
- Hive Hadoop is like SQL, so for any SQL developer the learning curve for Hive will almost be negligible.
- Hive Hadoop can be integrated with HBase for querying the data in HBase whereas this is not possible with Pig. In case of Pig, a function named HbaseStorage () will be used for loading the data from HBase.
- Hive Hadoop has gained popularity as it is supported by Hue.
- Hive Hadoop has various user groups such as CNET, Last.fm, Facebook, and Digg and so on.
- PIG Hadoop
History: Pig Hadoop was developed by Yahoo in the year 2006 so that they can have an ad-hoc method for creating and executing MapReduce jobs on huge data sets. The main motive behind developing Pig was to cut-down on the time required for development via its multi query approach. Pig is a high level data flow system that renders you a simple language platform popularly known as Pig Latin that can be used for manipulating data and queries. Pig is used by Microsoft, Yahoo and Google, to collect and store large data sets in the form of web crawls, click streams and search logs. Pig at times finds its usage in ad-hoc analysis and processing of information.
What makes Pig Hadoop popular?
- Pig Hadoop follows a multi query approach thus it cuts down on the number times the data is scanned.
- Pig Hadoop is very easy to learn read and write if you are familiar with SQL.
- Pig provides the users with a wide range of nested data types such as Maps, Tuples and Bags that are not present in MapReduce along with some major data operations such as Ordering, Filters, and Joins.
- Performance of Pig is on par with the performance of raw Map Reduce.
- Pig has various user groups for instance 90% of Yahoo’s MapReduce is done by Pig, 80% of Twitter’s MapReduce is also done by Pig and various other companies such as Sales force, LinkedIn, AOL and Nokia also employ Pig.
Benefits of Pig Hadoop and Hive Hadoop: Pig Hadoop and Hive Hadoop have a similar goal- they are tools that ease the complexity of writing complex java MapReduce programs. However, when to use Pig Latin and when to use HiveQL is the question most of the developers have. Instead of writing Java code to implement MapReduce, one can opt between Pig Latin and Hive SQL languages to construct MapReduce programs. Benefit of coding in Pig and Hive is – much fewer lines of code, which reduces the overall development and testing time. Pig Latin has many of the usual data processing concepts that SQL has, such as filtering, selecting, grouping, and ordering, but the syntax is a little different from SQL.Hive is commonly used at Facebook for analytical purposes. Facebook promotes the Hive language. However, Yahoo! is a big advocate for Pig Latin. Yahoo! has one of the biggest Hadoop clusters in the world. Their data engineers use Pig for data processing on their Hadoop clusters. Alternatively, you may choose one among Pig and Hive at your organization, if no standards are set. Data engineers have better control over the dataflow (ETL) processes using Pig Latin, especially with procedural language background. A data analyst finds that one can ramp up on Hadoop faster, by using Hive, especially with previous experience of SQL. If you really want to become a Hadoop expert, then you should learn both Pig and Hive for the ultimate flexibility.
Differences between Pig and Hive- Depending on the purpose and type of data you can either choose to use Hive Hadoop component or Pig Hadoop Component based on the below differences :
1) Hive Hadoop Component is used mainly by data analysts whereas Pig Hadoop Component is generally used by Researchers and Programmers.
2) Hive Hadoop Component is used for completely structured Data whereas Pig Hadoop Component is used for semi structured data.
3) Hive Hadoop Component has a declarative SQLish language (HiveQL) whereas Pig Hadoop Component has a procedural data flow language (Pig Latin)
4) Hive Hadoop Component is mainly used for creating reports whereas Pig Hadoop Component is mainly used for programming.
5) Hive Hadoop Component operates on the server side of any cluster whereas Pig Hadoop Component operates on the client side of any cluster.
6) Hive Hadoop Component is helpful for ETL whereas Pig Hadoop is a great ETL tool for big data because of its powerful transformation and processing capabilities.
7) Hive can start an optional thrift based server that can send queries from any nook and corner directly to the Hive server which will execute them whereas this feature is not available with Pig.
8) Hive directly leverages SQL expertise and thus can be learnt easily whereas Pig is also SQL-like but varies to a great extent and thus it will take some time efforts to master Pig.
9) Hive makes use of exact variation of the SQL DLL language by defining the tables beforehand and storing the schema details in any local database whereas in case of Pig there is no dedicated metadata database and the schemas or data types will be defined in the script itself.
10) The Hive Hadoop component has a provision for partitions so that you can process the subset of data by date or in an alphabetical order whereas Pig Hadoop component does not have any notion for partitions though might be one can achieve this through filters.
11) Pig supports Avro whereas Hive does not.
12) Pig can be installed easily over Hive as it is completely based on shell interaction
13) Pig Hadoop Component renders users with sample data for each scenario and each step through its “Illustrate” function whereas this feature is not incorporated with the Hive Hadoop Component.
14) Hive has smart inbuilt features on accessing raw data but in case of Pig Latin Scripts we are not pretty sure that accessing raw data is as fast as with HiveQL.
15) You can join, order and sort data dynamically in an aggregated manner with Hive and Pig however Pig also provides you an additional COGROUP feature for performing outer joins.
There is no battle between HIVE and PIG in the real world. They don’t have the same advantages and disadvantages while processing enormous amounts of data. It’s just the initial ambiguity on deciding the tool which suits the need. HIVE Query language (HiveQL) suits the specific demands of analytics meanwhile PIG supports huge data operation. PIG was developed as an abstraction to avoid the complicated syntax of Java programming for MapReduce. On the other hand HIVE QL is based around SQL, which makes it easier to learn for those who know SQL. AVRO is supported by PIG making serialization faster. When it really boils down on taking decision between Pig and Hive, the suitability of the each component for the given business logic must be considered and then the decision must be taken.
Conclusion: To conclude with after having understood the difference between Pig and Hive, both Hive Hadoop and Pig Hadoop Component will help to achieve the same goals, we can say that Pig is a script kiddy and Hive comes in, innate for all the natural database developers. When it comes to access choices, Hive is said to have more features over Pig. Both the Hive and Pig components are reportedly having near about the same number of committers in every project and likely in the near future we are going to see great advancements in both on the development front.
Asst. Prof (Dept. Of IT)