Wednesday, January 27, 2016

Setting up Pig 0.15 - Hadoop @ Desk (Single Node Cluster)

Hope you've setup your Hadoop Single Node Cluster @ Your Desk.

In this tutorial, we will setup and test Pig 0.15.0. (Before you start, snapshot your VM, if not already done)

Note: You need to change paths as per your environment (i.e in my case I'm using '/media/SYSTEM', you've to replace it with yours)

Steps below:

1. Start your VM (Or Host, if you've installed Hadoop directly on Host)

2. Get Pig 0.15 and move to our dedicated partition (as that of Hadoop) for better management

$ su hduser $ cd $ wget http://www.eu.apache.org/dist/pig/latest/pig-0.15.0.tar.gz $ tar -xvf pig-0.15.0.tar.gz $ sudo mv pig-0.15.0 /media/SYSTEM/hadoop/pig/pig-0.15.0 $ sudo chown hduser pig

3. Update .bashrc file, to have 'Pig' specific configuration

$ vi .bashrc #To avoid 'Found interface jline.Terminal, but class was expected' #export HADOOP_USER_CLASSPATH_FIRST=false #PIG VARIABLES START export PIG_INSTALL=/media/SYSTEM/hadoop/pig/pig-0.15.0 export PATH=${PATH}:${PIG_INSTALL}/bin #PIG VARIABLES END



NB: Please note to include 'HADOOP_USER_CLASSPATH_FIRST' environment variable, otherwise, Pig will have compatibility issues with Java Libraries


4. Editing configuration files for Pig

Add a 'pigbootup' file with empty content (Pig expects this file to auto populate its values)

By default Pig will write logs to the root partition. Move the logs file to a separate location, for better management.

$ touch ~/.pigbootup $ mkdir /media/SYSTEM/hadoop/pig/pig-0.15.0/logs $ vi /media/SYSTEM/hadoop/pig/pig-0.15.0/conf/pig.properties pig.logfile=/media/SYSTEM/hadoop/pig/pig-0.15.0/logs/

5. Reboot

6. Start hadoop

$ start-all.sh

7. Testing Pig (The famous `Word Count` Example - In MapReduce/Hadoop Mode)

$ su hduser $ cd $ cat > words.txt this is a test file contains words $ hdfs dfs -copyFromLocal words.txt words.txt $ pig grunt> A = load './words.txt'; grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; grunt> C = group B by word; grunt> D = foreach C generate COUNT(B), group; grunt> dump D

1


8 Stop Hadoop, Shutdown and Snapshot your VM

$ stop-all.sh

$ sudo shutdown now

No comments:

Post a Comment