Sunday, January 18, 2015

UNION with and with out "ONSCHEMA" in Pig Latin

 UNION in Pig:-

Some one who is familiar with A relational database management system (RDBMS), by now you might have understood that it is used to merge two data sets. Its right, but there is a difference, what is  that we will see with example.
data set 1:-                                                                                      data set 2












Lets combine the two data sets(here is the difference) you might be thinking how we can do a UNION when the 2 data sets has different structures, if its RDBMS (a clear no) the UNION will never work. But in PIG the story is different that is beauty of PIG LATIN (this why it is said that pig can work with unstructured data).
Let's execute the below PIG Latin script and see the result.

Step 1:-  Load data from first data set.
A = LOAD 'hdfs://localhost:8020/user/healthcare_Sample_dataset2.csv' using PigStorage(',') AS (PatientID: int, Name: chararray, DOB: chararray, PhoneNumber: chararray, EmailAddress: chararray, SSN: chararray, Gender: chararray, Disease: chararray, weight: float); 

Step 2:- Limit the data to 10 rows of the first data set.
D = LIMIT A 10;


Step 3:- Load the data from data set 2.
B = LOAD 'hdfs://localhost:8020/user/healthcare_Sample_dataset1.csv' using PigStorage(',') AS (PatientID: int, Name: chararray);


Step 4:- Limit the data to 10 rows from data set 2 as well.

E = LIMIT B 10;

Step 5:- Merge the 2 data sets with UNION command in Pig Latin.

C = UNION D, E;

Finally dump the data to see the result.

DUMP C;
 
This shows that to "UNION" data sets they need not to have same structure for Pig Latin.

we can merge more than 2 data sets in one single relation.
Ex:- 
UN = UNION A,B,C.

What is the difference between UNION with ONSCHEMA and which out it.
In the above example we already see the out put with out "ONSCHEMA". Let see by using it with UNION.
Follow the above mentioned steps as-it-is till  step 4, there is small change in step 5. change step 5 to

Step 5:- Merge the 2 data sets with UNION command with "ONSCHEMA" in Pig Latin.
C = UNION ONSCHEMA D, E;  
Lets see the result.
The above result shows that data set 1 has add null column to match with data set 2 columns. This is what the difference is with and with out ONSCHEMA.

No comments :