Introduction : As discussed previously it is a data flow language. 
Relation : Once your processing step is completed , it generates new data-set. Which we named it as Relation.

Example : 
myData = load 'myData.txt'
  • Here 'myData' is a new Relation, which is generated based on load processing step. 
Remember : 
  • Keywords of Pig Latin are case insensitive (e.g. load data  and LOAD data both are same)  but not alias/relation name.

Load Statement : Using this we will specify Input data to our script. (Usually this should be your first step)
  • By default, load looks for your data on HDFS in a tab-delimited file using the default load function PigStorage.

LOAD 'file/directory path' [USING function] [AS schema];
  • Path : If you specify a directory name, all the files in the directory are loaded.
  • USING : It is a keyword, which help us to which function should be used to load data. As this is optional, if we dont use USING keyword then it will use 'PigStorage' function.
  • function : You can use in-built function or your custom function to load the data.
  • Schema : The loader produces the data of the type specified by the schema. If the data does not conform to the schema, depending on the loader, either a null value or an error is generated.
Example of loading CSV file (HandsOn).

  • Step 1 : Download file and save in HDFS (As shown in video)
  • Step 2 : Upload this file in HDFS
  • Step 3 : Write a pig Script as below.
categories = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(','); 
DESCRIBE categories;
Schema for categories unknown.
DUMP categories; -- You must avoid 
(3,2,Baseball & Softball)

Example of loading CSV file and defining schema(HandsOn).

categoriesWithSchema = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(',') AS (id:int,subId:int,categoryName:chararray); 
DESCRIBE categoriesWithSchema ;
categoriesWithSchema: {id: int,subId: int,categoryName: chararray}

Example with Tab separated file(HandsOn)
  • Step 1 : Download tab separated file
  • Step 2 : Upload this file in HDFS
  • Step 3 : Now write Pig Script to load this file.
categoriesWithSchemaTab = LOAD '/user/cloudera/Training/pig/catTab.txt' AS (id:int,subId:int,categoryName:chararray); 
DESCRIBE categoriesWithSchemaTab;
categoriesWithSchemaTab: {id: int,subId: int,categoryName: chararray}
DUMP categoriesWithSchemaTab ;
ILLUSTRATE categoriesWithSchemaTab ;
| categoriesWithSchemaTab     | id:int    | subId:int    | categoryName:chararray    | 
|                             | 51        | 8            | NHL                       | 

Loading Data from HBase 
divs = load 'myData' using HBaseStorage();

Store statement : It will save data in file system.

Syntax :
STORE 'alias/relation name' INTO 'directory' [USING function];
  • alias : Name of the relation (which holds our calculated data). Needs to be stored in file system.
  • 'directory' : The name of the storage directory, in quotes. If the directory already exists, the STORE operation will fail.
      • The output data files, named part-nnnnn, are written to this directory.
  • If the USING clause is omitted, the default store function PigStorage is used.
  • PigStorage is the default store function and does not need to be specified (simply omit the USING clause).
  • You can write your own store function if your data is in a format that cannot be processed by the built in functions 
Example Save relation to HDFS(HandsOn)
STORE categoriesWithSchemaTab INTO '/user/cloudera/Training/pig/output/Tab/he1'; -- Save as tab separated data
STORE categoriesWithSchemaTab INTO '/user/cloudera/Training/pig/output/csv/he1' USING PigStorage(','); -- Save as csv data

Now verify data in hdfs directory as below.
cat /user/cloudera/HEPig/Tab/he1
cat /user/cloudera/HEPig/csv/he1

Training4Exam Info,
Aug 3, 2016, 10:52 AM
Training4Exam Info,
Aug 3, 2016, 11:08 AM