Image may be NSFW.
Clik here to view.
Apache Pig is a high-level procedural language platform developed to simplify querying large data sets in Apache Hadoop and MapReduce., Pig is popular for performing query operations in hadoop using “Pig Latin” language, this layer that enables SQL-like queries to be performed on distributed datasets within Hadoop applications, due to its simple interface, support for doing complex operations such as joins and filters, which has the following key properties:
- Ease of programming. Pig programs are easy to write and which accomplish huge tasks as its done with other Map-Reducing programs.
- Optimization: System optimize pig job’s execution automatically, allowing the user to focus on semantics rather than efficiency.
- Extensibility: Pig Users can write their own user defined functions (UDF) to do special-purpose processing as per the requirement using Java/Phyton and JavaScript.
How it works:
Pig runs on Hadoop and makes use of MapReduce and the Hadoop Distributed File System (HDFS). The language for the platform is called Pig Latin, which abstracts from the Java MapReduce idiom into a form similar to SQL. Pig Latin is a flow language which allows you to write a data flow that describes how your data will be transformed. Since Pig Latin scripts can be graphs it is possible to build complex data flows involving multiple inputs, transforms, and outputs. Users can extend Pig Latin by writing their own User Defined functions, using Java, Python, Ruby, or other scripting languages.
We can run Pig in two modes:
Local Mode
Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets which a single machine could handle. It runs on a single JVM and access the local filesystem.
MapReduce Mode
This is the default mode Pig translates the queries into MapReduce jobs, which requires access to a Hadoop cluster.
we will discuss more about pig, setting up pig with hadoop, running PigLatin scripts in Local and MapReduce Mode in my next posts.