What Is Pig?Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data.
Pig is an Apache open source project. This means users are free to download it as source or binary, use it for themselves, contribute to it, and—under the terms of the Apache License—use it in their products and change it as they see fit. |
Pig in hadoop |
According to google
Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. |
About Pig and PigLatin ?
Pig was initially developed at Yahoo! to allow people using Apache Hadoop to focus more on analyzing large data sets and spend less time having to write mapper and reducer programs. Like actual pigs, who eat almost anything, the Pig programming language is designed to handle any kind of data—hence the name!
Pig is made up of two components: the first is the language itself, which is called PigLatin (yes, people naming various Hadoop projects do tend to have a sense of humor associated with their naming conventions), and the second is a runtime environment where PigLatin programs are executed. Think of the relationship between a Java Virtual Machine (JVM) and a Java application. In this section, we’ll just refer to the whole entity as Pig.
Sources-IBM
What is the difference between pig and SQL?
Pig Latin is procedural version of SQL.pig has certainly similarities, more difference from sql.sql is a query language for user asking question in query form.sql makes answer for given but don’t tell how to answer the given question. Suppose ,if user want to do multiple operations on tables, we have write multiple queries and also use temporary table for storing, sql is support for sub queries but intermediate we have to use temporary tables, SQL users find sub queries confusing and difficult to form properly. Using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query .pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of sub queries or to worry about storing data in temporary tables.
What is main differences between hive vs pig vs sql?HIVE:
Hive scores over PIG in Partitions, Server, Web interface & JDBC/ODBC support.
Some differences:
1) Hive is best for structured Data & PIG is best for semi structured data
2) Hive used for reporting & PIG for programming
3) Hive used as a declarative SQL & PIG used as procedural language
4) Hive supports partitions & PIG does not
5) Hive can start an optional thrift based server & PIG can't
6) Hive defines tables before hand (schema) + stores schema information in database and PIG don't have dedicated metadata of database
7) Hive does not support Avro but PIG does
8) Pig also supports additional COGROUP feature for performing outer joins but hive does not. But both Hive & PIG can join, order & sort dynamically.
How does Pig differ from MapReduce?
In MapReduce, groupby operation performed at reducer side and filter, projection can be implemented in the map phase.pig Latin also provides standard-operation similar to MapReduce like orderby and filters, groupby. Etc. We can analyze pig script and know data flows an also early to find the error checking.pig Latin is much lower cost to write and maintain than Java code for MapReduce.
How is Pig Useful For?
In three categories, we can use pig. They are
1) ETL data pipeline
2) Research on raw data and
3) Iterative processing.
-Most common use case for pig is data pipeline. Let us take one example, web based companies gets the weblogs, so before storing data into warehouse, they do some operations on data like cleaning and aggregation operations.etc. I.e. Transformations on data.
What are the scalar data types in pig?
-scalar datatype
-int -4bytes,
-float -4bytes,
-double -8bytes,
-long -8bytes,
-chararray,
-bytearray
What are the complex data types in pig?
map:
map in pig is chararray to data element mapping where element have pig data type including complex data type.
example of map [‘city’#’hyd’,’pin’#500086]
the above example city and pin are data elements(key) mapping to values
tuple:
tuple have fixed length and it have collection datatypes.tuple containing multiple fields and also tuples are ordered.
example, (hyd,500086) which containing two fields.
bag:
A bag containing collection of tuples which are unordered,Bag constants are constructed using braces, with tuples in the bag separated by com-
mas. For example, {(‘hyd’, 500086), (‘chennai’, 510071), (‘bombay’, 500185)}
What are relational operations in pig Latin?
they are
a)for each
b)order by
c)filters
d)group
e)distinct
f)join
g)limit
Why should we use ‘filters’ in pig scripts?
Filters are similar to where clause in SQL.filter which contain predicate.If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not.predicate contain different operators like ==,>=,<=,!=.so,== and != can be applied to maps and tuples.
A= load ‘inputs’ as(name,address)
B=filter A by symbol matches ‘CM.*';
Why should we use ‘group’ keyword in pig scripts?
The group statement collects together records with the same key.In SQL the group by clause creates a group that must feed directly into one or more aggregate functions. In Pig Latin there is no direct connection between group and aggregate functions.
input2 = load ‘daily’ as (exchanges, stocks);
grpds = group input2 by stocks;
Why should we use ‘orderby’ keyword in pig scripts?
The order statement sorts your data for you, producing a total order of your output data.The syntax of order is similar to group. You indicate a key or set of keys by which you wish to order your data
input2 = load ‘daily’ as (exchanges, stocks);
grpds = order input2 by exchanges;
Why should we use ‘distinct’ keyword in pig scripts?
The distinct statement is very simple. It removes duplicate records. It works only on entire records, not on individual fields:
input2 = load ‘daily’ as (exchanges, stocks);
grpds = distinct exchanges;
Is it posible to join multiple fields in pig scripts?
yes,
Join select records from one input and join with another input.This is done by indicating keys for each input. When those keys are equal, the two rows are joined.
input2 = load ‘daily’ as (exchanges, stocks);
input3 = load ‘week’ as (exchanges, stocks);
grpds = join input2 by stocks,input3 by stocks;
we can also join multiple keys
example:
input2 = load ‘daily’ as (exchanges, stocks);
input3 = load ‘week’ as (exchanges, stocks);
grpds = join input2 by (exchanges,stocks),input3 by (exchanges,stocks);
Is it possible to display the limited no of results?
yes,
Sometimes you want to see only a limited number of results. ‘limit’ allows you do this:
input2 = load ‘daily’ as (exchanges, stocks);
first10 = limit input2 10;
Pig was initially developed at Yahoo! to allow people using Apache Hadoop to focus more on analyzing large data sets and spend less time having to write mapper and reducer programs. Like actual pigs, who eat almost anything, the Pig programming language is designed to handle any kind of data—hence the name!
Pig is made up of two components: the first is the language itself, which is called PigLatin (yes, people naming various Hadoop projects do tend to have a sense of humor associated with their naming conventions), and the second is a runtime environment where PigLatin programs are executed. Think of the relationship between a Java Virtual Machine (JVM) and a Java application. In this section, we’ll just refer to the whole entity as Pig.
Sources-IBM
What is the difference between pig and SQL?
Pig Latin is procedural version of SQL.pig has certainly similarities, more difference from sql.sql is a query language for user asking question in query form.sql makes answer for given but don’t tell how to answer the given question. Suppose ,if user want to do multiple operations on tables, we have write multiple queries and also use temporary table for storing, sql is support for sub queries but intermediate we have to use temporary tables, SQL users find sub queries confusing and difficult to form properly. Using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query .pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of sub queries or to worry about storing data in temporary tables.
What is main differences between hive vs pig vs sql?HIVE:
- Hive is a Dataware house system for Hadoop that facilitates easydata summarisation ,adhoc queries,and analysis of large datasets stored in Hadoop compatible Filesystems.
- Hive provides a mechanism to query the data using Sql like lamguage called as HIVE QL or HQL.
- Hive enables developers not familiar with MapReduce to write data queries that are translated into MapReduce jobs in Hadoop.
- No support for Update or Delete.
- No support for inserting single rows.
- Limited number of Built in functions
- Not all Standard SQL is supported.
- An abstraction over the complexity of MapReduce programming, the Pig platform includes an execution environment and a scripting language (Pig Latin) used to analyzeHadoop data sets.
- Its compiler translates Pig Latin into sequences of MapReduce programs
Hive scores over PIG in Partitions, Server, Web interface & JDBC/ODBC support.
Some differences:
1) Hive is best for structured Data & PIG is best for semi structured data
2) Hive used for reporting & PIG for programming
3) Hive used as a declarative SQL & PIG used as procedural language
4) Hive supports partitions & PIG does not
5) Hive can start an optional thrift based server & PIG can't
6) Hive defines tables before hand (schema) + stores schema information in database and PIG don't have dedicated metadata of database
7) Hive does not support Avro but PIG does
8) Pig also supports additional COGROUP feature for performing outer joins but hive does not. But both Hive & PIG can join, order & sort dynamically.
How does Pig differ from MapReduce?
In MapReduce, groupby operation performed at reducer side and filter, projection can be implemented in the map phase.pig Latin also provides standard-operation similar to MapReduce like orderby and filters, groupby. Etc. We can analyze pig script and know data flows an also early to find the error checking.pig Latin is much lower cost to write and maintain than Java code for MapReduce.
How is Pig Useful For?
In three categories, we can use pig. They are
1) ETL data pipeline
2) Research on raw data and
3) Iterative processing.
-Most common use case for pig is data pipeline. Let us take one example, web based companies gets the weblogs, so before storing data into warehouse, they do some operations on data like cleaning and aggregation operations.etc. I.e. Transformations on data.
What are the scalar data types in pig?
-scalar datatype
-int -4bytes,
-float -4bytes,
-double -8bytes,
-long -8bytes,
-chararray,
-bytearray
What are the complex data types in pig?
map:
map in pig is chararray to data element mapping where element have pig data type including complex data type.
example of map [‘city’#’hyd’,’pin’#500086]
the above example city and pin are data elements(key) mapping to values
tuple:
tuple have fixed length and it have collection datatypes.tuple containing multiple fields and also tuples are ordered.
example, (hyd,500086) which containing two fields.
bag:
A bag containing collection of tuples which are unordered,Bag constants are constructed using braces, with tuples in the bag separated by com-
mas. For example, {(‘hyd’, 500086), (‘chennai’, 510071), (‘bombay’, 500185)}
What are relational operations in pig Latin?
they are
a)for each
b)order by
c)filters
d)group
e)distinct
f)join
g)limit
Why should we use ‘filters’ in pig scripts?
Filters are similar to where clause in SQL.filter which contain predicate.If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not.predicate contain different operators like ==,>=,<=,!=.so,== and != can be applied to maps and tuples.
A= load ‘inputs’ as(name,address)
B=filter A by symbol matches ‘CM.*';
Why should we use ‘group’ keyword in pig scripts?
The group statement collects together records with the same key.In SQL the group by clause creates a group that must feed directly into one or more aggregate functions. In Pig Latin there is no direct connection between group and aggregate functions.
input2 = load ‘daily’ as (exchanges, stocks);
grpds = group input2 by stocks;
Why should we use ‘orderby’ keyword in pig scripts?
The order statement sorts your data for you, producing a total order of your output data.The syntax of order is similar to group. You indicate a key or set of keys by which you wish to order your data
input2 = load ‘daily’ as (exchanges, stocks);
grpds = order input2 by exchanges;
Why should we use ‘distinct’ keyword in pig scripts?
The distinct statement is very simple. It removes duplicate records. It works only on entire records, not on individual fields:
input2 = load ‘daily’ as (exchanges, stocks);
grpds = distinct exchanges;
Is it posible to join multiple fields in pig scripts?
yes,
Join select records from one input and join with another input.This is done by indicating keys for each input. When those keys are equal, the two rows are joined.
input2 = load ‘daily’ as (exchanges, stocks);
input3 = load ‘week’ as (exchanges, stocks);
grpds = join input2 by stocks,input3 by stocks;
we can also join multiple keys
example:
input2 = load ‘daily’ as (exchanges, stocks);
input3 = load ‘week’ as (exchanges, stocks);
grpds = join input2 by (exchanges,stocks),input3 by (exchanges,stocks);
Is it possible to display the limited no of results?
yes,
Sometimes you want to see only a limited number of results. ‘limit’ allows you do this:
input2 = load ‘daily’ as (exchanges, stocks);
first10 = limit input2 10;