AV: January 2016

Saturday, January 30, 2016

Use Ubuntu Packages in Debian (& Vice-Versa) with apt-get

Debian is a great operating system worth a try. Its lightweight and got a great backing community. Its for more advanced users. If you want to migrate to Debian, from Ubuntu for better control or whatever reason, you may miss some good features or programs, thats distributed through the ubuntu repositories.

I recently made a choice to move to Debian Jessie LTS, from Ubuntu 14.04 LTS. When setting up ‘KVM’ with nested virtualization, I’ve found that the command ‘kvm-ok’ belong to the original Ubuntu main repository, thats not seems available in Debian. Below are the steps I’ve done to get the Ubuntu repository in Debian to install the necessary Ubuntu packages.

1. From an Ubuntu System, get the repository URL, in which the desired package belongs.

From the Ubuntu system, open ‘/etc/apt/sources.list’. Get the required repository URL. Here we’re taking the main ubuntu repository URL, in which ‘cpu-checker’ package belongs to. KVM-OK comes with ‘cpu-checker’.

Note: Please see we are using ‘trusy main’, as Trusty is based on the Debian Jessie to which we are going to import the packages. So we retain package compatibility there.

To check, which Debian version on which the ubuntu has based, see this link.

2. Add the repository URL to the target Debian source List and update.

$ echo 'deb http://in.archive.ubuntu.com/ubuntu/ trusty-updates main restricted' >> /etc/apt/sources.list

$ sudo apt-get update

You will get errors as the Ubutu specific Public Keys are not present in the Debian System. Looking at the error you can find the public keys you need to import. For eg: I’ve got the below error, after the update operation.

“W: GPG error: http://in.archive.ubuntu.com trusty-updates InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 40976EAF437D05B5 NO_PUBKEY 3B4FE6ACC0B21F32”

So I’had to run the below commands

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 40976EAF437D05B5

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 3B4FE6ACC0B21F32

3. Update Debain Again, and install the required package using apt-get.

sudo apt-get udpate

sudo apt-get install cpu-checker

sudo kvm-ok

INFO: /dev/kvm exists
KVM acceleration can be used

Wednesday, January 27, 2016

Setting up SQOOP 1.4 - Hadoop @ Desk (Single Node Cluster)

Hope you've setup your Hadoop Single Node Cluster @ Your Desk.

In this tutorial, we will setup SQOOP 1.4.6. (Before you start, snapshot your VM, if not already done).
For testing, We will use SQOOP, to import a RDBMS table from MySQL to Hadoop Hive.

Note: MySQL installation and setting up dummy data has been discussed in the Appendix section.

Note: You need to change paths as per your environment (i.e in my case I'm using '/media/SYSTEM', you've to replace it with yours)

Steps below:

1. Start your VM (Or Host, if you've installed Hadoop directly on Host)

2. Get Sqoop 1.4.6 and move to our dedicated partition (as that of Hadoop) for better management


$ su hduser
Password: 

$ cd
$ wget http://www.eu.apache.org/dist/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
 
$ sudo tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
$ sudo mkdir -p /media/SYSTEM/hadoop/sqoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha
$ sudo mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha /media/SYSTEM/hadoop/sqoop/
$ sudo chown hduser /media/SYSTEM/hadoop/sqoop/

3. Update .bashrc file, to have 'Sqoop' specific configuration


$ vi .bashrc

#SQOOP VARIABLES START
export SQOOP_HOME=/media/SYSTEM/hadoop/sqoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha
export PATH=$PATH:$SQOOP_HOME/bin
#SQOOP VARIABLES END

Now close the terminal and reopen a new one again (to get the new environment variables to effect)



4. Editing configuration files for SQOOP


$ su hduser
$ cd $SQOOP_HOME/conf
$ sudo cp sqoop-env-template.sh sqoop-env.sh
$ vi sqoop-env.sh

export HADOOP_COMMON_HOME=/media/SYSTEM/hadoop/hadoop-2.7.0
export HADOOP_MAPRED_HOME=/media/SYSTEM/hadoop/hadoop-2.7.0
export HIVE_HOME=/media/SYSTEM/hadoop/hive/apache-hive-1.2.1

5. Setup MySQL drivers, to be used by SQOOP (for importing MySQL tables to Hive)


$ cd $SQOOP_HOME/lib
$ sudo wget http://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.1.38.tar.gz

$ sudo tar -zxvf mysql-connector-java-5.1.38.tar.gz$ sudo cp mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar ./

6. Reboot



7. Start hadoop


$ su hduser
$ cd
$ start-dfs.sh
$ start-yarn.sh

8. Copy Sqoop specific JAR's to HDFS

So that every datanode can access the libraries for Sqoop processing.
NB: Without this, SQOOP will not work properly!


$ hdfs dfs -mkdir -p $SQOOP_HOME/lib
$ hdfs dfs -copyFromLocal $SQOOP_HOME/lib/* $SQOOP_HOME/lib/
$ hdfs dfs -copyFromLocal $SQOOP_HOME/sqoop-1.4.6.jar $SQOOP_HOME/

9. Start MySQL

$ mysql -u mysqluser -p

10. Now Import a Table from MySQL to Hive using SQOOP


$ sqoop import –bindir ./ --connect jdbc:mysql://localhost:3306/scooptest --username mysqluser --password pass1 --table employee --hive-import --hive-overwrite

Note: See Appendix section, regarding MySQL installation and table setup, for this test.

Now take the 'hive' prompt, and see your data has been populated inside Hive tables


$ hive
$ select * from employee where id >= 2;

11. Stop Hadoop, Shutdown and Snapshot your VM


$ stop-all.sh
$ sudo shutdown now

Appendix:

MySQL installation and setting up some tables for the SQOOP Test.

MySQL User: 'mysqluser' (To be used for SQOOP import)
Database: sqooptest
Table: employee


$ sudo apt-get install mysql-server
$ mysql -u root -p
mysql> create database scooptest;
mysql> grant all on scooptest.* to 'mysqluser' identified by 'pass1';
mysql> use scooptest;
mysql> create table employee(id int primary key, name text);
mysql> insert into employee values (1, 'smith');
mysql> insert into employee values (2, 'john');
mysql> insert into employee values (3, 'henry');

Setting up Pig 0.15 - Hadoop @ Desk (Single Node Cluster)

Hope you've setup your Hadoop Single Node Cluster @ Your Desk.

In this tutorial, we will setup and test Pig 0.15.0. (Before you start, snapshot your VM, if not already done)

Note: You need to change paths as per your environment (i.e in my case I'm using '/media/SYSTEM', you've to replace it with yours)

Steps below:

1. Start your VM (Or Host, if you've installed Hadoop directly on Host)

2. Get Pig 0.15 and move to our dedicated partition (as that of Hadoop) for better management

$ su hduser $ cd $ wget http://www.eu.apache.org/dist/pig/latest/pig-0.15.0.tar.gz $ tar -xvf pig-0.15.0.tar.gz $ sudo mv pig-0.15.0 /media/SYSTEM/hadoop/pig/pig-0.15.0 $ sudo chown hduser pig

3. Update .bashrc file, to have 'Pig' specific configuration

$ vi .bashrc
#To avoid 'Found interface jline.Terminal, but class was expected'
#export HADOOP_USER_CLASSPATH_FIRST=false
#PIG VARIABLES START
export PIG_INSTALL=/media/SYSTEM/hadoop/pig/pig-0.15.0
export PATH=${PATH}:${PIG_INSTALL}/bin
#PIG VARIABLES END



NB: Please note to include 'HADOOP_USER_CLASSPATH_FIRST' environment variable, otherwise, Pig will have compatibility issues with Java Libraries



4. Editing configuration files for Pig

Add a 'pigbootup' file with empty content (Pig expects this file to auto populate its values)

By default Pig will write logs to the root partition. Move the logs file to a separate location, for better management.

$ touch ~/.pigbootup $ mkdir /media/SYSTEM/hadoop/pig/pig-0.15.0/logs $ vi /media/SYSTEM/hadoop/pig/pig-0.15.0/conf/pig.properties pig.logfile=/media/SYSTEM/hadoop/pig/pig-0.15.0/logs/

5. Reboot

6. Start hadoop

$ start-all.sh

7. Testing Pig (The famous `Word Count` Example - In MapReduce/Hadoop Mode)

$ su hduser
$ cd
$ cat > words.txt
this is a 
test file contains words
$ hdfs dfs -copyFromLocal words.txt words.txt
$ pig
grunt> A = load './words.txt';
grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
grunt> C = group B by word;
grunt> D = foreach C generate COUNT(B), group;
grunt> dump D

8 Stop Hadoop, Shutdown and Snapshot your VM

$ stop-all.sh

$ sudo shutdown now

Setting up Hive 1.2.1 - Hadoop @ Desk (Single Node Cluster)

Hope you've setup your Hadoop Single Node Cluster @ Your Desk.

In this tutorial, we will setup and test Hive 1.2.1. (Before you start, snapshot your VM, if not already done)

Note: You need to change paths as per your environment (i.e in my case I'm using '/media/SYSTEM', you've to replace it with yours)

Steps below: (I've compiled the steps from here and here)

1. Start your VM (Or Host, if you've installed Hadoop directly on Host)

2. Get Hive 1.2.1 and move to our dedicated partition (as that of Hadoop) for better management

$ su hduser
$ cd
$ wget http://archive.apache.org/dist/hive/stable/apache-hive-1.2.1-bin.tar.gz

$ tar -xzvf apache-hive-1.2.1-bin.tar.gz
$ mkdir -p /media/SYSTEM/hadoop/hive/apache-hive-1.2.1/
$ mv apache-hive-1.2.1-bin/ /media/SYSTEM/hadoop/hive/apache-hive-1.2.1/

3. Update .bashrc file, to have 'hive' specific configuration

$ vi ~/.bashrc

#HIVE VARIABLES START HIVE_HOME=/media/SYSTEM/hadoop/hive/apache-hive-1.2.1 export HIVE_HOME export PATH=$PATH:$HIVE_HOME/bin #HIVE VARIABLES END

4. Update hive-config.sh, to have Hadoop Home directory

Append the export line, to the file

$ vi /media/SYSTEM/hadoop/hive/apache-hive-1.2.1/bin/hive-config.sh

export HADOOP_HOME=/media/SYSTEM/hadoop/hadoop-2.7.0

5. Start hadoop, if not already done

$ start-all.sh


6. Create Hive specific directories

$ hadoop fs -mkdir /tmp && hadoop fs -mkdir -p /user/hive/warehouse && hadoop fs -chmod g+w /tmp && hadoop fs -chmod g+w /user/hive/warehouse
 

$ cp /media/SYSTEM/hadoop/hive/apache-hive-1.2.1/lib/jline-2.12.jar /media/SYSTEM/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/

7. Stop Hadoop and Reboot

$ stop-all.sh $ sudo reboot

8. Start Hadoop and then Hive

$ su hduser $ cd $ start-all.sh $ hive

9. Test Hive (On the hive prompt Create a Hive Table, Do some insert and Select)

hive> !clear;
hive> create table employee(name string, id int);
hive> insert into employee values('george',1);
hive> insert into employee values('mathew',2);
hive> select name from employee where id = 2;
hive>quit;

Note: you can actually see Map Reduce jobs are being created on the fly while executing these commands

11. Stop Hadoop, Shutdown and Snapshot your VM

$ stop-all.sh

$ sudo shutdown now