Data Transformation using following major operators.
  • Sorting
  • Grouping
  • Joining
  • Projecting
  • Filtering
foreach : Suppose you have 60 records in a relation, then you can apply your operation using 'foreach' , on each record of your relation.

Syntax : 
alias  = FOREACH { block };
myData = foreach categories generate *; -- It will select all the columns from categories and generate new relation myData

alias : Name of the relation also it is a outer bag.

Example 1 : Projection using foreach (HandsOn) : 
categories = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(','); 
myData = foreach categories generate *;
dump myData;
(1,2,Football)
(2,2,Soccer)
myDataSelected = foreach categories generate $0,$1; --Selecting only first 2 columns
dump myDataSelected;
(1,2)
(2,2)
(3,2)

Example 2 : Projection using schema (HandsOn)
categories2 = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(',') AS (id:int, subId:int, catName:chararray); 
selectedCat = foreach categories2 generate subId,catName; --Selecting only two columns, using column name
DUMP selectedCat;
(2,Football)
(2,Soccer)
(2,Baseball & Softball)
subtract = foreach categories2 generate id-subId; --This is just to show you, you can use expression
dump subtract;
(-1)
(0)
(1)

So you can refer columns in a relation with
  • Their position like $0 - first element, $1- Second element (This should be used , when schema is not defined)
  • Their name as defined by schema like id, subId etc (Case sensitive)
  • Using * , you can select all the columns
Example 3 : Another way of selecting columns using two dots ..  (HandsOn
selectedCat3 = foreach categories2 generate id..catName; --Select all the columns between id and catName
DUMP selectedCat3;
selectedCat4 = foreach categories2 generate subId..; --Select all the columns subId and rest which comes after subId
DUMP selectedCat4;
selectedCat5 = foreach categories2 generate ..catName; --Select all the columns comes before catName inclusive
DUMP selectedCat5;

Example 4 : Loading Complex Datatypes (HandsOn)

Step 1 : Download complex data
Step 2 : Below is the schema for the data.
member = tuple(member_id:int
		, member_email:chararray
		, name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray)
		, course_list:bag{course(course_name:chararray)}
		, technology:map[tech_name:chararray]
	    )

Step 3 : Upload data in hdfs first at /user/cloudera/Training/pig/complexData.txt
Step 4 : Now write Pig script using above schema to load the data and create a relation out of this.

hadoopexamMember = LOAD '/user/cloudera/Training/pig/complexData.txt' using PigStorage('|') 
AS (member_id:int , member_email:chararray , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray) , course_list:bag{course: tuple(course_name:chararray)} , technology:map[chararray] ) ;
DESCRIBE hadoopexamMember ;
DUMP hadoopexamMember ;

Step 5 : Now find all the course lists subscribed by each member and its programming skills (To analyze , based on programming skills which all course they are interested in , so this results can be helpful to recommend same course list to similar member having same programming skills)
primaryProgramming = FOREACH hadoopexamMember GENERATE member_id, member_email, name.first_name, course_list, technology#'programming1' ;
DESCRIBE primaryProgramming ;
DUMP primaryProgramming ;

Step 6 : Now see some flatten results.
getFirstCourseSubscribed = FOREACH hadoopexamMember GENERATE member_id, member_email, name.first_name, FLATTEN(course_list), technology#'programming1' ;
DESCRIBE getFirstCourseSubscribed ;
DUMP getFirstCourseSubscribed ;

Step 7 : Get number of courses subscribed by each user
memberCourseCount= FOREACH hadoopexamMember GENERATE  member_email, COUNT(course_list);
dump memberCourseCount;

Note : We will see flatten operator in coming modules.

Example 5 : Loading compressed files (HandsOn)

Step 1 : Download compressed file.

Step 2 : Upload in hdfs using Hue

Step 3 : Load compressed data using Pig
hadoopexamCompress = Load '/user/cloudera/Training/pig/complexData.txt.gz'  using PigStorage('|')   AS (member_id:int
, member_email:chararray , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray) , course_list:bag{course: tuple(course_name:chararray)} , technology:map[chararray] ) ;
DESCRIBE hadoopexamCompress ;
DUMP hadoopexamCompress ;

Example 6 : Store relation as compressed files (HandsOn)
STORE hadoopexamMember INTO '/user/cloudera/Training/pig/output/complexData.txt.bz2' using PigStorage('|') ;

Now check in HDFS using Hue, that following directory has been created or not in HDFS.
/user/cloudera/Training/pig/output/complexData.txt.bz2

Note : Bag does not guarantee that their tuples will always be in order. 

Example 7 : Write way of solving the problems. (HandsOn)
Step 1 : Download file

Step 2 : Upload in hdfs using Hue UI at /user/cloudera/HEPig/handson11/module11Score.txt

Step 3 : Now write following Pig Script (Wrong steps)
memberScore = load '/user/cloudera/Training/pig/module11Score.txt' USING PigStorage('|') as (email:chararray, spend1:int, spend2:int);
DESCRIBE memberScore;
memberScore: {email: chararray,spend1: int,spend2: int}
groupedScore = group memberScore by email; -- produces bag memberScore containing all the records for a given value of email
DESCRIBE groupedScore;
groupedScore: {group: chararray,memberScore: {(email: chararray,spend1: int,spend2: int)}}
-- Below statement will through error. Because memberScore is a bag record of memberScore can have multiple tuples. Which will not allow you to do operation like (bag+bag)
totalScore = foreach groupedScore generate SUM(memberScore.spend1 + memberScore.spend2); 

Correct steps 
totalScore = foreach groupedScore generate SUM(memberScore.spend1 + memberScore.spend2); -- This will through error.
memberExpense1 = foreach memberExpense generate email, spend1 + spend2 as rowLevelExpense;
describe memberExpense1 ;
memberExpense1: {email: chararray,rowLevelExpense: int}
groupedData = group memberExpense1 by email;
describe groupedData ;
groupedData: {group: chararray,memberExpense1: {(email: chararray,rowLevelExpense: int)}}
totalExpense = foreach groupedData generate SUM(memberExpense1.rowLevelExpense);
describe totalExpense ;
totalExpense: {long}

Example 8 : Nested FOREACH statements to solved same problem. (HandsOn)
memberScore = load '/user/cloudera/HEPig/handson11/module11Score.txt' USING PigStorage('|') as (email:chararray, spend1:int, spend2:int);
DESCRIBE memberScore;
memberScore: {email: chararray,spend1: int,spend2: int}
groupedScore = group memberScore by email; -- produces bag memberScore containing all the records for a given value of email
DESCRIBE groupedScore;
groupedScore: {group: chararray,memberScore: {(email: chararray,spend1: int,spend2: int)}}
result = foreach groupedScore {
  individualScore = foreach memberScore generate (spend1+spend2);--it will iterate only over the records of memberScore bag
    generate group, SUM(individualScore);
};
dump result;
describe result;
result: {group: chararray,long}



Pig Filter Function
I have a bunch of strings that have various prefixes including "unknown:" I'd really like to filter out all the strings starting with "unknown:" in my Pig script, but it doesn't appear to work.

simpleFilter = FILTER records BY NOT(mystr MATCHES '^unknown');

I've tried a few other permutations of the regex, but it appears that MATCHES just doesn't work well with NOT. Am I missing something?

It's because the matches operator operates exactly like Java's String#matches, i.e. it tries to match the entire String and not just part of it (the prefix in your case). Just update your regular expression to match the the entire string with your specified prefix, like so:

simpleFilter = FILTER records BY NOT(mystr MATCHES '^unknown.*');

-----------
I have a pig job where in i need to filter the data by finding a word in it,

Here is the snippet

A = LOAD '/home/user/filename' USING PigStorage(',');
B = FOREACH A GENERATE $27,$38;
C = FILTER B BY ( $1 ==  '*Word*');
STORE C INTO '/home/user/out1' USING PigStorage();
the error is in the 3rd line while finding C, i have also tried using

       C = FILTER B BY $1 MATCHES '*WORD*'  
Also

      C = FILTER B BY $1 MATCHES '\\w+WORD\\w+'  
Can you please correct and help me out.

Thanks

MATCHES uses regular expressions. You should do ... MATCHES '.*WORD.*' instead.
------------------------------------------
------------------------------------------
Does PIG support in clause?

filtered = FILTER bba BY reason not in ('a','b','c','d');
or should i split it up into multiple ORs?

Thanks!

No, Pig doesn't support IN Clause. I had a similar situation. Though you can use AND operator and filter keyword as a work around. like

A= LOAD 'source.txt' AS (user:chararray, age:chararray);

B= FILTER A BY ($1 matches 'tapan') AND ($1 matches 'superman');

However, if the number of filtering required is huge. Then, probably, you can just create a relation that contains all these keywords and do a join to filter wherever the occurrence matches. Hope this helps.

You can get by using AND/OR/NOT

===================


1
down vote
favorite
I have a input file like this:

481295b2-30c7-4191-8c14-4e513c7e7577,1362974399,56973118825,56950298471,true
67912962-dd84-46fa-84ef-a2fba12c2423,1362974399,56950556676,56982431507,false
cc68e779-4798-405b-8596-c34dfb9b66da,1362974399,56999223677,56998032823,true
37a1cc9b-8846-4cba-91dd-19e85edbab00,1362974399,56954667454,56981867544,false
4116c384-3693-4909-a8cc-19090d418aa5,1362974399,56986027804,56978169216,true
I only need the line which the last filed is "true". So I use the following Pig Latin:

records = LOAD 'test/test.csv' USING PigStorage(',');
A = FILTER records BY $4 'true';
DUMP A;
The problem is the second command, I always get the error:

2013-08-07 16:48:11,505 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 25>  mismatched input ''true'' expecting SEMI_COLON
Why? I also try "$4 == 'true'" but still doesn't work though. Could anyone tell me how to do this simple thing?


How about:

A = FILTER records BY $4 == 'true' ;
Also, if you know how many fields the data will have beforehand, you should give it a schema. Something like:

records = LOAD 'test/test.csv' USING PigStorage(',') 
          AS (val1: chararray, val2: int, val3: int, val4: int, bool: chararray);
Or whatever names/types fit your needs.

http://stackoverflow.com/questions/10314282/filtering-in-pig
http://stackoverflow.com/questions/29064422/pig-filter-out-rows-with-improper-number-of-columns

http://stackoverflow.com/questions/16569912/apache-pig-easier-way-to-filter-by-a-bunch-of-values-from-the-same-field
http://stackoverflow.com/questions/22196120/filtering-inside-a-foreach-block-by-a-condition-value-calculated-inside-the-same
http://stackoverflow.com/questions/31675626/pig-udf-on-filter
http://stackoverflow.com/questions/10349618/too-many-filter-matching-in-pig
http://stackoverflow.com/questions/13165337/filtering-null-values-with-pig
http://stackoverflow.com/questions/18213790/conditional-filter-in-group-by-in-pig
http://stackoverflow.com/questions/9716673/using-filter-after-foreach-in-pig-latin-failed
http://stackoverflow.com/questions/10974621/pig-latin-relaxed-equals-with-null
http://stackoverflow.com/questions/21546320/pig-udf-not-being-able-to-filter-words
http://stackoverflow.com/questions/17764761/comparing-datetime-in-pig
http://stackoverflow.com/questions/24637281/how-to-filter-nan-in-pig



ċ
complexData.txt
(1k)
Training4Exam Info,
Aug 5, 2016, 12:08 PM
ċ
complexData.txt.gz
(0k)
Training4Exam Info,
Aug 5, 2016, 12:56 PM
ċ
module11Score.txt
(0k)
Training4Exam Info,
Aug 5, 2016, 1:51 PM