Monday, January 19, 2015

DISTINCT in Pig Latin

Distinct:-

Use the DISTINCT operator to remove duplicate tuples in a relation. (to eliminate duplicates, Pig must first sort the data).

Ex:- DATA set.
F1, F2, F3
1,    2,   3
1,    2,   3
3,    2,    3
4,     ,    3

Step 1:- Load the data from HDFS to the relation DIST.
DIST = LOAD 'hdfs://localhost:8020/user/Sample.csv' USING PigStorage(',') as (F1:int, F2:int,F3:int);
Note:- The Key word 'PigStorage' is case sensitive 'P' and 'S' should be capital and rest are small letters.

Step2:- Remove the duplicate tuples from the relation DIST.
RM_DUPLIC = DISTINCT DIST;

Step3:- DUMP the data to grunt shell.
DUMP RM_DUPLIC;

Out put:-1,2,3
3,2,3
4,,3

What if we want to DISTINCT on a subset of fields.?
What is subset?
In the above example subset means for example (F2,F3) where as set is (F1,F2,F3). So how we can perform distinct on subset (F2,F3) of fields.

Can we get the expected output with the below script. Lets see
RM_DUPLIC = DISTINCT F,F3;  DUMP RM_DUPLIC;

OutPut:
-

Its failed .......
ERROR:- 2015-01-19 21:25:26,817 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 26, column 25>  mismatched input '.' expecting SEMI_COLON


That means one cannot use DISTINCT on a subset of fields. To do this, use FOREACH with FLATTEN and a nested block. Lets see with an example.
Lets take the same data set as above.

Ex:- DATA set.
F1, F2, F3
1,    2,   3
1,    2,   3
3,    2,    3
4,     ,    3


Expected result as we are using distinct on F2 and F3 :-
F1,F2,F3
3,   2,  3 
4,     ,  3
Run the below script and see the out put.
Athletes = LOAD 'sample.csv' USING PigStorage(',') as (F1:int, F2:int, F3:int);
field_list_group = GROUP Athletes BY (F2,F3);
A_unique =
    FOREACH field_list_group {
         LMT = LIMIT Athletes.F1 1;
GENERATE FLATTEN (group) as (F2,F3), flatten(LMT) as F1;
    };  DUMP A_unique;
 

OUT PUT:-
(2,3,3)
(,3,4)