Image may be NSFW.
Clik here to view.
Apache Hive is an open-source data warehouse infrastructure built on top of Hadoop for providing data summary, query, and analyzing large datasets stored in Hadoop files, it is developed by Facebook and it provides
- Tools to enable easy data extract/transform/load (ETL)
- A mechanism to impose structure on a variety of data formats
- Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBase
- Query execution via MapReduce
It supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop. In addition, HiveQL supports custom MapReduce scripts to be plugged into queries. Hive also enables data serialization/deserialization and increases flexibility in schema design by including a system catalog called Hive-Metastore.
According to the Apache Hive wiki, “Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs).”
Hive supports text files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a columnar database way.)
There is a SqlDevloper/Toad kind of tool named HiveDeveloper, developed by Stratapps Inc which gives users the power to visualize their data stored in Hadoop as table views and do many more operations.
In my next blog I will be explaining about how to setup Hive on top of Hadoop cluster, before that please check how to setup Hadoop in my previous blog post so that you will be ready to configure Hive on top of it.