Monday, January 19, 2015

DISTINCT in Pig Latin

Distinct:-

Use the DISTINCT operator to remove duplicate tuples in a relation. (to eliminate duplicates, Pig must first sort the data).

Ex:- DATA set.
F1, F2, F3
1,    2,   3
1,    2,   3
3,    2,    3
4,     ,    3

Step 1:- Load the data from HDFS to the relation DIST.
DIST = LOAD 'hdfs://localhost:8020/user/Sample.csv' USING PigStorage(',') as (F1:int, F2:int,F3:int);
Note:- The Key word 'PigStorage' is case sensitive 'P' and 'S' should be capital and rest are small letters.

Step2:- Remove the duplicate tuples from the relation DIST.
RM_DUPLIC = DISTINCT DIST;

Step3:- DUMP the data to grunt shell.
DUMP RM_DUPLIC;

Out put:-1,2,3
3,2,3
4,,3

What if we want to DISTINCT on a subset of fields.?
What is subset?
In the above example subset means for example (F2,F3) where as set is (F1,F2,F3). So how we can perform distinct on subset (F2,F3) of fields.

Can we get the expected output with the below script. Lets see
RM_DUPLIC = DISTINCT F,F3;  DUMP RM_DUPLIC;

OutPut:
-

Its failed .......
ERROR:- 2015-01-19 21:25:26,817 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 26, column 25>  mismatched input '.' expecting SEMI_COLON


That means one cannot use DISTINCT on a subset of fields. To do this, use FOREACH with FLATTEN and a nested block. Lets see with an example.
Lets take the same data set as above.

Ex:- DATA set.
F1, F2, F3
1,    2,   3
1,    2,   3
3,    2,    3
4,     ,    3


Expected result as we are using distinct on F2 and F3 :-
F1,F2,F3
3,   2,  3 
4,     ,  3
Run the below script and see the out put.
Athletes = LOAD 'sample.csv' USING PigStorage(',') as (F1:int, F2:int, F3:int);
field_list_group = GROUP Athletes BY (F2,F3);
A_unique =
    FOREACH field_list_group {
         LMT = LIMIT Athletes.F1 1;
GENERATE FLATTEN (group) as (F2,F3), flatten(LMT) as F1;
    };  DUMP A_unique;
 

OUT PUT:-
(2,3,3)
(,3,4)


Sunday, January 18, 2015

UNION with and with out "ONSCHEMA" in Pig Latin

 UNION in Pig:-

Some one who is familiar with A relational database management system (RDBMS), by now you might have understood that it is used to merge two data sets. Its right, but there is a difference, what is  that we will see with example.
data set 1:-                                                                                      data set 2












Lets combine the two data sets(here is the difference) you might be thinking how we can do a UNION when the 2 data sets has different structures, if its RDBMS (a clear no) the UNION will never work. But in PIG the story is different that is beauty of PIG LATIN (this why it is said that pig can work with unstructured data).
Let's execute the below PIG Latin script and see the result.

Step 1:-  Load data from first data set.
A = LOAD 'hdfs://localhost:8020/user/healthcare_Sample_dataset2.csv' using PigStorage(',') AS (PatientID: int, Name: chararray, DOB: chararray, PhoneNumber: chararray, EmailAddress: chararray, SSN: chararray, Gender: chararray, Disease: chararray, weight: float); 

Step 2:- Limit the data to 10 rows of the first data set.
D = LIMIT A 10;


Step 3:- Load the data from data set 2.
B = LOAD 'hdfs://localhost:8020/user/healthcare_Sample_dataset1.csv' using PigStorage(',') AS (PatientID: int, Name: chararray);


Step 4:- Limit the data to 10 rows from data set 2 as well.

E = LIMIT B 10;

Step 5:- Merge the 2 data sets with UNION command in Pig Latin.

C = UNION D, E;

Finally dump the data to see the result.

DUMP C;
 
This shows that to "UNION" data sets they need not to have same structure for Pig Latin.

we can merge more than 2 data sets in one single relation.
Ex:- 
UN = UNION A,B,C.

What is the difference between UNION with ONSCHEMA and which out it.
In the above example we already see the out put with out "ONSCHEMA". Let see by using it with UNION.
Follow the above mentioned steps as-it-is till  step 4, there is small change in step 5. change step 5 to

Step 5:- Merge the 2 data sets with UNION command with "ONSCHEMA" in Pig Latin.
C = UNION ONSCHEMA D, E;  
Lets see the result.
The above result shows that data set 1 has add null column to match with data set 2 columns. This is what the difference is with and with out ONSCHEMA.

Hadoop Pig Latin

Star Expressions:- 

Its is similar to SQL * operator.  to select and display all the columns in the table (with data) 
 In the above script we are loading a file into the relation Athletes which has total 8 columns separated my comma delimiter.
The statement A = FOREACH Athletes GENERATE * ; DUMP A is equal to
SELECT * FROM TABLE;

To display data in the pig grunt shell one should use DUMP command;
For the above example it will be DUMP A;

There are some restrictions in the usage of Star expression Lets see what are they.
1) For GROUP/COGROUP, you can't include a star expression in a GROUP BY column.

2) For ORDER BY, if you have project-star as ORDER BY column, you can’t have any other ORDER BY column in that statement.

How to use ORDER BY  Clause in PIG Latin.
Lets see the above script result.
we can see NULL values are returned last (as we used descending order).

How to do multiple columns order by in Pig Latin.
Ex:- 
Step 1:-  Load the data from HDFS to the relation Athletes
Athletes = LOAD 'hdfs://localhost:8020/user/OlympicAthletes.csv' USING PigStorage(',') as (athlete:chararray, country:chararray, year:int, sport:chararray, gold:int, silver:int, bronze:int, total:int);
STEP 2:- Perform descending order by on country, year on the relation 'Athletes'.
var = ORDER  Athletes BY country, year DESC ;
Step 3:- dump the result of step, so that we can see the end result on grunt shell.
DUMP var;