Apache Pig is a high-level procedural language platform developed to simplify querying large data sets in Apache Hadoop and MapReduce., Pig is popular for performing query operations in hadoop using “Pig Latin” language, this layer that enables SQL-like queries to be performed on distributed datasets within Hadoop applications, due to its simple interface, support for doing complex operations such as joins and filters, which has the following key properties:
- Ease of programming. Pig programs are easy to write and which accomplish huge tasks as its done with other Map-Reducing programs.
- Optimization: System optimize pig job’s execution automatically, allowing the user to focus on semantics rather than efficiency.
- Extensibility: Pig Users can write their own user defined functions (UDF) to do special-purpose processing as per the requirement using Java/Phyton and JavaScript.
Objective
The objective of this tutorial is for setting up Pig and running Pig scripts.
Prerequisites
The following are the prerequisites for setting up Pig and running Pig scripts.
- You should have the latest stable build of Hadoop up and running, to install hadoop, please check my previous blog article on Hadoop Setup.
Setting up Pig
Procedure
- Download a stable version of Pig file from apache download mirrors, For this tutorial we are using pig-0.11.1,this release works with Hadoop 0.20.X, 1.X, 0.23.X and 2.X
wget http://apache.mirrors.hoobly.com/pig/pig-0.11.1/pig-0.11.1.tar.gz
2. Copy the pig binaries into the /usr/local/pig directory.
cp -r pig-0.11.1.tar.gz /usr/local/pig
3. Change the directory to /usr/local/pig by using this command
cd /usr/local/pig
4. Unpack the compressed pig, in the directory /usr/local/pig
sudo tar xvzf pig-0.11.1.tar.gz
5. set PIG_HOME in $HOME/.bashrc so it will be set every time you login. Add the following line to it.
export PIG_HOME=<path_to_pig_home_directory> e.g. export PIG_HOME='/usr/local/pig/pig-0.11.1' export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$JAVA_HOME/bin:$PATH
6. Set the environment variable JAVA_HOME to point to the Java installation directory, which Pig uses internally.
export JAVA_HOME=<<Java_installation_directory>>
Execution Modes
Pig has two modes of execution – local mode and MapReduce mode.
Local Mode
Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local filesystem.
To run in local mode, please pass the following command:
$ pig -x local grunt>
MapReduce Mode
This is the default mode Pig translates the queries into MapReduce jobs, which requires access to a Hadoop cluster.
$ pig
2013-10-28 11:39:44,767 [main] INFO org.apache.pig.Main – Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53
2013-10-28 11:39:44,767 [main] INFO org.apache.pig.Main – Logging error messages to: /home/hduser/pig_1382985584762.log
2013-10-28 11:39:44,797 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /home/hduser/.pigbootup not found
2013-10-28 11:39:45,094 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://Hadoopmaster:54310
2013-10-28 11:39:45,592 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to map-reduce job tracker at: Hadoopmaster:54311
grunt>
You can see the log reports from Pig stating the filesystem and jobtracker it connected to. Grunt is an interactive shell for your Pig queries. You can run Pig programs in three ways via Script, Grunt, or embedding the script into Java code. Running in Interactive shell is shown in the Problem section. To run a batch of pig scripts, it is recommended to place them in a single file with .pig extension and execute them in batch mode, will explain them in depth in coming posts.