AV: 2016

Monday, February 29, 2016

Setting up Hive2 Server On Hadoop 2.7+ (Multi-Node-Cluster On Ubuntu 14.04 LXD Containers)

In this article we build Hive2 Server on a Hadoop 2.7 cluster. We’ve a dedicated node (HDAppsNode-1) for Hive (and other apps) with in the cluster, which is highlighted in the below deployment digram, showing our cluster model in Azure. We will keep the Hive Meta Store in a seperate MySQL instance running on a seperate host (HDMetaNode-1) to have a production grade system, rather than keeping it in the default embeded database. This article assume, you’ve already configured Hadoop 2.0+ on your cluster. The steps we’ve followed to create the cluster can be found here, which is to build a Single Node Cluster. We’ve cloned the Single Node, to multiple nodes (7 Nodes as seen below), and then updated the Hadoop configuration files to transform it to a multi-node cluster. This blog has helped us to do the same. The updated Hadoop Configuration files for the below model (Multi-Node-Cluster) has been shared here for your reference.

Lets get started.

1. Create Hive2 Meta Store in MySql running on HDMetaNode-1.

sudo apt-get install mysql-server

<Loging to my sql using the default user: root>

CREATE DATABASE hivemetastore;
USE hivemetastore;
CREATE USER 'hive'@'%' IDENTIFIED BY 'hive';
GRANT all on *.* to 'hive'@'HDAppsNode-1' identified by 'hive';

2. Get Hive2.

We are keeping Hive binaries under (/media/SYSTEM/hadoop/hive/apache-hive-2.0.0)

cd/media/SYSTEM/hadoop/hive

wget http://mirror.cc.columbia.edu/pub/software/apache/hive/stable-2/apache-hive-2.0.0-bin.tar.gz
tar -xvf apache-hive-2.0.0-bin.tar.gz

mv apache-hive-2.0.0-bin apache-hive-2.0.0
cd apache-hive-2.0.0

mv conf/hive-default.xml.template conf/hive-site.xml

Edit ‘hive-site.xml’, to configure MySql Meta Store and Hadoop related configurations. Please change as per your environment. (/media/SYSTEM/hadoop/tmp) is our Hadoop TMP directory in local filesystem.

Apart from that, we’d to replace all below occurances to make Hive2 work with our cluster,

${system:java.io.tmpdir}/ with /media/SYSTEM/hadoop/tmp/hive/

/${system:user.name} with /

<configuration>

<property>
    <name>hive.exec.local.scratchdir</name>
    <value>/media/SYSTEM/hadoop/tmp/hive/${system:user.name}</value>
    <description>Local scratch space for Hive jobs</description>
</property>
<property>
    <name>hive.downloaded.resources.dir</name>
    <value>/media/SYSTEM/hadoop/tmp/hive/${hive.session.id}_resources</value>
    <description>Temporary local directory for added resources in the remote file system.</description>
</property>

   <property>

      <name>javax.jdo.option.ConnectionURL</name>

      <value>jdbc:mysql://HDMetaNode-1/hivemetastore?createDatabaseIfNotExist=true</value>

      <description>metadata is stored in a MySQL server</description>

   </property>

   <property>

      <name>javax.jdo.option.ConnectionDriverName</name>

      <value>com.mysql.jdbc.Driver</value>

      <description>MySQL JDBC driver class</description>

   </property>

   <property>

      <name>javax.jdo.option.ConnectionUserName</name>

      <value>hive</value>

      <description>user name for connecting to mysql server</description>

   </property>

   <property>

      <name>javax.jdo.option.ConnectionPassword</name>

      <value>hive</value>

      <description>password for connecting to mysql server</description>

   </property>

</configuration>

3. Update Hadoop Config

core-site.xml (Add the below tags)

property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hduser.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hduser.groups</name>
<value>*</value>
</property>

4. Setup Hive Server

Update ~/.bashrc and ~/.profile , to contain the Hive2 path

#HIVE VARIABLES START

HIVE_HOME=/media/SYSTEM/hadoop/hive/apache-hive-2.0.0

export HIVE_HOME

export PATH=$PATH:$HIVE_HOME/bin

#HIVE VARIABLES END

Refresh the environment

source ~/.bashrc

Setup and Create Meta Store in MySql (You may need to download MySqlConnector JAR file to the lib folder)

bin/schematool -dbType mysql -initSchema

Start Hive2 Server

hiveserver2

Sunday, February 28, 2016

Setting up Oozie 4.1.0 On Hadoop 2.7+ (Multi-Node-Cluster On Ubuntu 14.04 LXD Containers)

In this article we build Oozie 4.1.0 on a Hadoop 2.7 cluster. We’ve not selected the latest Oozie 4.2.0 which have build issues with Hadoop 2.0+ till date. We’ve a dedicated node (HDAppsNode-1) for Oozie (or other apps) with in the cluster, which is highlighted in the below deployment digram, showing our cluster model in Azure. We will keep the Oozie Meta data in a seperate MySQL instance running on a seperate host (HDMetaNode-1) to have a production grade system, rather than keeping it in the default Derby database. This article assume, you’ve already configured Hadoop 2.0+ on your cluster. The steps we’ve followed to create the cluster can be found here, which is to build a Single Node Cluster. We’ve cloned the Single Node, to multiple nodes (7 Nodes as seen below), and then updated the Hadoop configuration files to transform it to a multi-node cluster. This blog has helped us to do the same. The updated Hadoop Configuration files for the below model (Multi-Node-Cluster) has been shared here for your reference.

Lets get started. We’ve referenced the following blogs to prepare oozie under Hadoop.2.0. (Link1, Link2, Link3). Also to make Oozie work, you’ve to start your Job History Server along with YARN. I’ve configured Job History Server in the same node as that of YARN (HDResNode-1), which have been started with YARN using the command (mr-jobhistory-daemon.sh start historyserver).

1. Firstly the CodeHaus Maven repository referenced in the Oozie build file has been moved to another mirror. We need to ovveride the maven settings to point to the new location.

Edit or Create (home/hduser/.m2/settings.xml) and add the below.

<settings>
<profiles>
<profile>
      <id>OozieProfile</id>

<repositories>
      <repository>
        <id>Codehaus repository</id>
        <name>codehaus-mule-repo</name>
        <url>https://repository-master.mulesoft.org/nexus/content/groups/public/
        </url>
        <layout>default</layout>
      </repository>
   </repositories>
</profile>
</profiles>
<activeProfiles>
    <activeProfile>OozieProfile</activeProfile>
</activeProfiles>
</settings>

2. Now we have to create a Meta Store for Oozie in MySql running on HDMetaNode-1.

sudo apt-get install mysql-server

<Loging to my sql using the default user: root>

create database oozie;

grant all privileges on oozie.* to 'oozie'@'HDAppsNode-1' identified by 'oozie';
grant all privileges on oozie.* to 'oozie'@'%' identified by 'oozie';

Edit (/etc/mysql/my.conf) to enable MySql to accept connections from hosts, other than localhost

bind-address = HDMetaNode-1

2. Get Oozie 4.1.0 and Build on HDAppsNode-1.

We are keeping Oozie binaries under (/media/SYSTEM/hadoop/oozie-4.1.0)cd/media/SYSTEM/hadoop/

wget http://archive.apache.org/dist/oozie/4.1.0/oozie-4.1.0.tar.gz
tar -xvf oozie-4.1.0.tar.gz
cd oozie-4.1.0

Update the pom.xml to change the default hadoop version to 2.3.0. The reason we’re not changing it to hadoop version 2.6.0 here is because 2.3.0-oozie-4.1.0.jar is the latest available jar file. Luckily it works with higher versions in 2.x series

vim pom.xml

--Search for
<hadoop.version>1.1.1</hadoop.version>
--Replace it with
<hadoop.version>2.3.0</hadoop.version>

Continue with Hadoop Build…

sudo apt-get install maven
bin/mkdistro.sh -DskipTests -P hadoop-2 -DjavaVersion=1.7 -DtargetJavaVersion=1.7

cd ..
mv /media/SYSTEM/hadoop/oozie-4.1.0 /media/SYSTEM/hadoop/oozie-4.1.0-build
cp -R /media/SYSTEM/hadoop/oozie-4.1.0-build/distro/target/oozie-4.1.0-distro/oozie-4.1.0 /media/SYSTEM/hadoop/oozie-4.1.0

3. Prepare Oozie Libraries.

Update both ~/.profile, ~/.bashrc file to contain Oozie path. Append the below

#OOZIE VARIABLES START
export PATH=$PATH:/media/SYSTEM/hadoop/oozie-4.1.0/bin
#OOZIE VARIABLES END

Relaod the environment.

source ~/.bashrc

Prepare Oozie

cd /media/SYSTEM/hadoop/oozie-4.1.0
mkdir libext

cp /media/SYSTEM/hadoop/oozie-4.1.0-build/hadooplibs/target/oozie-4.1.0-hadooplibs.tar.gz .

tar -xvf oozie-4.1.0-hadooplibs.tar.gz

cp oozie-4.1.0/hadooplibs/hadooplib-2.3.0.oozie-4.1.0/* libext/

cd libext

wget http://dev.sencha.com/deploy/ext-2.2.zip
mv openlogic-extjs-2.2-all-src-1.zip ext-2.2.zip

rm -fr /media/SYSTEM/hadoop/oozie-4.1.0/oozie-4.1.0-hadooplibs.tar.gz

4. Update Hadoop and Oozie Config files

core-site.xml (Add the below tags)

<property>
    <name>hadoop.proxyuser.oozie.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.oozie.groups</name>
    <value>*</value>
</property>

oozie-site.xml (Add/update the below tags). Pleas note the MySql and Hadoop directory configurations. Please change as per your environment.

<property>
        <name>oozie.service.JPAService.jdbc.driver</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>oozie.service.JPAService.jdbc.url</name>
        <value>jdbc:mysql://HDMetaNode-1:3306/oozie</value>
    </property>
    <property>
        <name>oozie.service.JPAService.jdbc.username</name>
        <value>oozie</value>
    </property>
    <property>
        <name>oozie.service.JPAService.jdbc.password</name>
        <value>oozie</value>
    </property>
<property>
        <name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
        <value>*=/media/SYSTEM/hadoop/hadoop-2.7.0/etc/hadoop</value>
    </property>

    <property>
        <name>oozie.service.WorkflowAppService.system.libpath</name>
        <value>hdfs:///user/${user.name}/share/lib</value>
    </property>

4. Add Oozie user and setup Oozie Server

Add Oozie user

sudo adduser oozie --ingroup hadoop

sudo chown –R /media/SYSTEM/hadoop/oozie-4.1.0 oozie:hadoop

sudo chmod –R a+rwx /media/SYSTEM/hadoop/oozie-4.1.0

su oozie

cd /media/SYSTEM/hadoop/oozie-4.1.0

Setup MySql connector

wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.31.tar.gz
tar -zxf mysql-connector-java-5.1.31.tar.gz
cp mysql-connector-java-5.1.31/mysql-connector-java-5.1.31-bin.jar /media/SYSTEM/hadoop/oozie-4.1.0/libext

Setup Logs

mkdir logs
sudo chmod -R a+rwx /media/SYSTEM/hadoop/oozie-4.1.0/logs

Setup and Create Meta Tables in MySql

bin/oozie-setup.sh db create –run

Setup Oozie WebApplication

sudo apt-get install zip
bin/oozie-setup.sh prepare-war

Setup Oozie Share Library in HDFS (Change the name node URL, as per your environment)

bin/oozie-setup.sh sharelib create -fs hdfs://HDNameNode-1:8020

Start Oozie and Test the status

bin/oozied.sh start
bin/oozie admin -oozie http://localhost:11000/oozie -status

5. Prepare Oozie Samples and run a sample through Oozie

Our Name Node running on HDNameNode:8020 and Resource Manager (YARN) running on HDResNode-1:8032. Hence we’ve to update the configuration of samples as below. Change the Host and port as per your environment

tar -zxvf oozie-examples.tar.gz
find examples/ -name "job.properties" -exec sed -i "s/localhost:8020/HDNameNode-1:8020/g" '{}' \;
find examples/ -name "job.properties" -exec sed -i "s/localhost:8021/HDResNode-1:8032/g" '{}' \;

Put the samples to HDFS

hdfs dfs -mkdir /user/oozie/examples

hdfs dfs -put examples/* /user/oozie/examples/

Run a sample by submitting a Job

oozie job -oozie http://HDAppsNode-1:11000/oozie -config examples/apps/map-reduce/job.properties -run

Check the status of the job

#now open a web browser and access "http://HDAppsNode-1:11000/oozie", you will see the submitted job

Tuesday, February 16, 2016

RDP Over SSH or RDP with HTTP-Proxy (Or Any Protocol over SSH)

This tutorial address the below scenarios:

1. Alternatives X11 forwaring (X11 forwaring has performance issues)

2. RDP to a remote host using Coorporate HTTP Proxy Server

3. Circumventing, native RDP’s inablity use a corporate proxy

4. Getting the remote desktop of a public server, through corporate proxy/firewall

Typically if you’re inside your corporate network, your network will be protected by a proxy server and firewall. But suppose you would like to access the remote desktop of your Linux Machine residing in internet or in public cloud (eg. Azure). By default all TCP connections to outside will be channelled through the Proxy server and firewall rules will be applied.

We can use RDP through existing SSH Tunnel with the concept called SSH Portforwarding. In our case, TCP packets to one of our local machine port (eg. 5000) will be routed to a desired port (eg. 3389 the RDP port) on the remote host through the encrypted SSH connection. All the proxy/firewall only see the SSH connection, but wont see the RDP connection, as it will be hidden inside the SSH encrypted session.

Plese note that this does not restricted to RDP alone, you can redirect any port with SSH, May it be VNC, FTP, HTTP, HTTP,... you can use SSH to forward any protocol of your choice. One obivious advantage is you get the high security and unbreakable encryption of SSH as the base for your channels.

Lets implement this in Windows/Linux Client machines in a corporate network (with a proxy server), which want to connect to a remote Ubuntu Server in the internet.

*) Install putty

*) Provide Proxy settings (You can check your wpad URL, to view your proxy settings)

*) Enable tunneling and forward local port 5000, to RDP port 3389 on server

*) Connect the putty session as usual to the remote server port 22

*) Open mstsc and connect to server using ‘localhost:5000’

The RDP session will be now routed through SSH, and server will respond with a RDP Loging screen.

Linux Client Machines:

** If you’re using Linux Clients like Ubuntu, Open SSH from your command line shell. The command will look like;

ssh -L 5000:localhost:3389 -p 22 -X username@23.110.008.200 -o "ProxyCommand=nc -X connect -x xyz.proxy.org:8080 %h %p"

Here LocalHost denotes the remote machine.

Then open RDP clients like Remmina, rdesktop and connect to localhost:5000

Read more here.

Hadoop Multi Node Cluster On Linux Containers - Backed By ZFS RAID File System - Inside Azure VM

We’re successful in building a Multi-Node-Hadoop-Cluster with only LXD-Linux Containers.

i.e All Name/Data Nodes have been built inside four LXD containers (One Name Node, 3 Data Nodes) hosted inside a single Azure Ubuntu VM. Here goes the major points about the implementation.

For unparalleled scalability, I've chosen a standard Azure Ubuntu VM (8-Core CPU, 14-GB RAM, 4x250GB Data Disks) with ZFS-RAID file system. i.e To increase the hardware capability, I can scale up my Azure VM Hardware any time. To scale up the storage, I can dynamically add Azure Storage Disks on demand, and then add them to the ZFS Pool, which will dynamically provide the additional storage to all running containers. Cool hah? :)

Also with ZFS file system you can clone, snapshot your file system at any moment. You can also Hot-Add additional storage any time, which will be immediately visible to the OS as it is running. ZFS also provides RAID-0 (I've chosen) and RAID-Z provisioning.

1. Provisioned the Ubuntu VM inside Azure. Installed LXDE, RDP packages to get the remote desktop on my Local Machine

2. Installed ZFS Tools and created a new ZFS Pool containing the 4x250GB disks (1TB)

4. Mounted ZFSPool so that, LXD directories will reside inside the ZFS

5. Installed LXD, and updated networking settings (Changed the LAN subnet IPs)

6. Pulled Ubuntu-Trusty-AMD64 image from LXD repository

7. Created the first LXD container (HDNameNode-1) and configured Hadoop Single Node Cluster In it

8. Snap-Shoted ZFS File System, to keep my Hadoop-Single-Node-Cluster for later retrieval if required

9. Now I've updated HDNameNode-1 container, to have multi-node configurations (Which all nodes should have in a cluster).
(I've used tutorials 1 & 2). Tutorial-2 is for Hadoop-1.0+, So I'had to rely on Tutorial-1 to get settings for Hadoop-2.0+ multi-node configuration using YARN.

10. Cloned HDNameNode-1 container using LXD utility, to have 3-more containers which will act as Data Nodes
(HDDataNode-1, HDDataNode-2, HDDataNode-3)

11. Updated IP and Network settings for Name/Data nodes as desired

12. Updated HDNameNode-1 container, to have name-node specific multi-node configurations

13. Updated all DataNode containers, to have Data-node specific multi-node configurations

14. Restarted all containers, and started hadoop on NameNode, which in turn powered up Data Nodes.

15. Now I'm running a 4-Node-Hadoop-Cluster.

16. I can clone any Data Node any time, to have more data nodes if desired in future

17. Snap-shotted my ZFS pool for disaster recovery.

A few commands Listed below for your reference:

---Add ZFS package


ppa:zfs-native/stable
apt-get update
apt-get install ubuntu-zfs


---Create ZFS Pool and Mount for LXD directories

sudo zpool create -f ZFS-LXD-Pool sdc sdd sde sdf -m none
sudo zfs create -p -o mountpoint=/var/lib/lxd    ZFS-LXD-Pool/LXD/var/lib
sudo zfs create -p -o mountpoint=/var/log/lxd    ZFS-LXD-Pool/LXD/var/log
sudo zfs create -p -o mountpoint=/usr/lib/lxd    ZFS-LXD-Pool/LXD/usr/lib

---Install LXD and update network settings for containers

add-apt-repository ppa:ubuntu-lxc/lxd-stable
apt-get update
sudo apt-get install lxd
#To update the subnet to 192.168.1.0 and DHCP Leases
sudo vi /etc/default/lxc-net
sudo service lxc-net restart

--Add Ubuntu Trusty AMD64 Container Image and create first container, to setup SingleNodeCluster

lxc remote add images images.linuxcontainers.org
lxc image copy images:/ubuntu/trusty/amd64 local: --alias=Trusty64
lxc launch Trusty64 HDNameNode-1
lxc stop HDNameNode-1
zfs snapshot  ZFS-LXD-Pool/LXD/var/lib@Lxd-Base-Install-With-Trusty64
zfs list -t snapshot

***Configure Hadoop Single Node Cluster On first container and Snapshot it

**change ip to 192.168.1.2


lxc start HDNameNode-1
lxc exec HDNameNode-1 /bin/bash
lxc stop HDNameNode-1
zfs snapshot  ZFS-LXD-Pool/LXD/var/lib@Hadoop-Single-Node-Cluster
zfs list -t snapshot

***Update Name Node configuration to have Multi-Node-Cluster-Settings

-----Clone Name Node to 3 Data Nodes


lxc copy HDNameNode-1 HDDataNode-1
lxc copy HDNameNode-1 HDDataNode-2
lxc copy HDNameNode-1 HDDataNode-3
#Change IPs to 192.168.1.3,4 and 5

***Update all Data Node configurations to have Multi-Node-Cluster-Settings

---Snapshot the Multi Node Cluster


sudo zfs snapshot -r ZFS-LXD-Pool@Hadoop-Multi-Node4-Cluster

Setting up SPARK 1.6 - Hadoop @ Desk

Hope you've setup Hadoop @ Desk.

In this tutorial, we will setup SPARK 1.6. (Before you start, snapshot your VM, if not already done). SPARK can run in standalone and well as in distributed mode. But to fully leverage its power, we are going to set it up on top of our core Hadoop installation. In this case SPARK will run in distributed mode over HDFS and is called 'SPARK over YARN`.

Note: You need to change paths as per your environment (i.e in my case I'm using '/media/SYSTEM', you've to replace it with yours)

Steps below:

1. Start your VM (Or Host, if you've installed Hadoop directly on Host)

2. Configure `HADOOP_CONF_DIR` environment variable to point to your Core Hadoop Configuration
(if not already done)


$ su hduser
$ cd
$ sudo leafpad ~/.bashrc
export HADOOP_CONF_DIR=/media/SYSTEM/hadoop/hadoop-2.7.0/etc/hadoop

3. Install Scala.
** This step is not required any more with new Spark builds


$ wget http://downloads.typesafe.com/scala/2.11.7/scala-2.11.7.tgz$ sudo tar -xzf scala-2.11.7.tgz
$ sudo mkdir -p /media/SYSTEM/hadoop/scala/
$ sudo chown hduser /media/SYSTEM/hadoop/scala/
$ sudo mv scala-2.11.7 /media/SYSTEM/hadoop/scala/
$ sudo leafpad ~/.bashrc
#SCALA VARIABLES START
export SCALA_HOME=/media/SYSTEM/hadoop/scala/scala-2.11.7
export PATH=$PATH:$SCALA_HOME/bin
#SCALA VARIABLES END

4. Download and Setup SPARK 1.6

We've to download a compatible version of SPARK, which is built for the specific Hadoop version that we've. In this case we've a Hadoop 2.7 setup. So I've chosen latest SPARK, which is built for the given Hadoop version. Please see the below diagram, which shows how to pick the right version of SPARK for your Hadoop installation.

URL here.

Once you choose your version, you will get a specific download URL, as seen in the above figure. That URL has been used to download the SPARK distribution below.


$ wget http://www.eu.apache.org/dist/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz$ sudo tar -xzf spark-1.6.0-bin-hadoop2.6.tgz
$ sudo mkdir -p /media/SYSTEM/hadoop/spark/
$ sudo chown hduser /media/SYSTEM/hadoop/spark/
$ sudo mv spark-1.6.0-bin-hadoop2.6 /media/SYSTEM/hadoop/spark/
$ sudo mv /media/SYSTEM/hadoop/spark/spark-1.6.0-bin-hadoop2.6 /media/SYSTEM/hadoop/spark/spark-1.6.0
$ vi ~/.bashrc
#SPARK VARIABLES START
export SPARK_HOME=/media/SYSTEM/hadoop/spark/spark-1.6.0
export PATH=$PATH:$SPARK_HOME/bin
#SPARK VARIABLES END

5. Now we've some configuration edits for SPARK (specific for Single Node Cluster)

The below configuration specifies, we're running a Single Node Cluster (localhost only). Also we've some default settings.


$ cd /media/SYSTEM/hadoop/spark/spark-1.6.0/conf
$ sudo cp spark-env.sh.template spark-env.sh
$ sudo leafpad spark-env.sh
export SPARK_MASTER_IP=localhost
export SPARK_WORKER_CORES=1<br>export SPARK_WORKER_MEMORY=800m
export SPARK_WORKER_INSTANCES=1

With below we will accept default settings for 'slaves' file. We will create a separate log directory for SPARK logs. Now we will update the Log4J settings, so that logs will be written to log-file in the specified log directory.


$ sudo cp slaves.template slaves
$ sudo leafpad slaves
$ sudo mkdir -p /media/SYSTEM/hadoop/spark/logs
$ sudo chown hduser /media/SYSTEM/hadoop/spark/logs
$ sudo cp log4j.properties.template log4j.properties
$ sudo leafpad log4j.properties
log4j.rootLogger=INFO, FILE
log4j.rootCategory=INFO, FILE
log4j.logger.org.eclipse.jetty=WARN
log4j.appender.FILE=org.apache.log4j.FileAppender
log4j.appender.FILE.File=/media/SYSTEM/hadoop/spark/logs/SparkOut.log
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

6. Test SPARK with YARN (distributed mode)

Start Hadoop and then SPARK under YARN.


$ start-dfs.sh 
//namenode UI should be up at http://localhost:50070/dfshealth.html#tab-overview
$ start-yarn.sh
//Yarn Cluster manager UI should be up at http://localhost:8088/cluster$ spark-shell --master yarn
//this will start spark-shell under yarn cluster

Now Scala prompt will appear. Run some simple commands to run them on Hadoop.


scala> sc.parallelize( 2 to 200).count
//should return res0: Long = 199
scala> exit






You can also run some default program to run SPARK

$ run-example SparkPi 5

The UI corresponding to the SPARK shell can be found with the below URL

http://localhost:4040/jobs/

7. Famous Word Count Program

I've used the below example to count the word 'is' in an input file (newwords.txt) resides in HDFS root folder.


scala> val input = sc.textFile("newwords.txt")
scala> val splitedLines = input.flatMap(line => line.split(" ")).filter(x => x.equals("is"))
scala> System.out.println(splitedLines.count())

Note: I've copied the input file to HDFS using


hdfs dfs -copyFromLocal newwords.txt

8. Snapshot your VM

Saturday, January 30, 2016

Use Ubuntu Packages in Debian (& Vice-Versa) with apt-get

Debian is a great operating system worth a try. Its lightweight and got a great backing community. Its for more advanced users. If you want to migrate to Debian, from Ubuntu for better control or whatever reason, you may miss some good features or programs, thats distributed through the ubuntu repositories.

I recently made a choice to move to Debian Jessie LTS, from Ubuntu 14.04 LTS. When setting up ‘KVM’ with nested virtualization, I’ve found that the command ‘kvm-ok’ belong to the original Ubuntu main repository, thats not seems available in Debian. Below are the steps I’ve done to get the Ubuntu repository in Debian to install the necessary Ubuntu packages.

1. From an Ubuntu System, get the repository URL, in which the desired package belongs.

From the Ubuntu system, open ‘/etc/apt/sources.list’. Get the required repository URL. Here we’re taking the main ubuntu repository URL, in which ‘cpu-checker’ package belongs to. KVM-OK comes with ‘cpu-checker’.

Note: Please see we are using ‘trusy main’, as Trusty is based on the Debian Jessie to which we are going to import the packages. So we retain package compatibility there.

To check, which Debian version on which the ubuntu has based, see this link.

2. Add the repository URL to the target Debian source List and update.

$ echo 'deb http://in.archive.ubuntu.com/ubuntu/ trusty-updates main restricted' >> /etc/apt/sources.list

$ sudo apt-get update

You will get errors as the Ubutu specific Public Keys are not present in the Debian System. Looking at the error you can find the public keys you need to import. For eg: I’ve got the below error, after the update operation.

“W: GPG error: http://in.archive.ubuntu.com trusty-updates InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 40976EAF437D05B5 NO_PUBKEY 3B4FE6ACC0B21F32”

So I’had to run the below commands

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 40976EAF437D05B5

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 3B4FE6ACC0B21F32

3. Update Debain Again, and install the required package using apt-get.

sudo apt-get udpate

sudo apt-get install cpu-checker

sudo kvm-ok

INFO: /dev/kvm exists
KVM acceleration can be used

Wednesday, January 27, 2016

Setting up SQOOP 1.4 - Hadoop @ Desk (Single Node Cluster)

Hope you've setup your Hadoop Single Node Cluster @ Your Desk.

In this tutorial, we will setup SQOOP 1.4.6. (Before you start, snapshot your VM, if not already done).
For testing, We will use SQOOP, to import a RDBMS table from MySQL to Hadoop Hive.

Note: MySQL installation and setting up dummy data has been discussed in the Appendix section.

Note: You need to change paths as per your environment (i.e in my case I'm using '/media/SYSTEM', you've to replace it with yours)

Steps below:

1. Start your VM (Or Host, if you've installed Hadoop directly on Host)

2. Get Sqoop 1.4.6 and move to our dedicated partition (as that of Hadoop) for better management


$ su hduser
Password: 

$ cd
$ wget http://www.eu.apache.org/dist/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
 
$ sudo tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
$ sudo mkdir -p /media/SYSTEM/hadoop/sqoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha
$ sudo mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha /media/SYSTEM/hadoop/sqoop/
$ sudo chown hduser /media/SYSTEM/hadoop/sqoop/

3. Update .bashrc file, to have 'Sqoop' specific configuration


$ vi .bashrc

#SQOOP VARIABLES START
export SQOOP_HOME=/media/SYSTEM/hadoop/sqoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha
export PATH=$PATH:$SQOOP_HOME/bin
#SQOOP VARIABLES END

Now close the terminal and reopen a new one again (to get the new environment variables to effect)



4. Editing configuration files for SQOOP


$ su hduser
$ cd $SQOOP_HOME/conf
$ sudo cp sqoop-env-template.sh sqoop-env.sh
$ vi sqoop-env.sh

export HADOOP_COMMON_HOME=/media/SYSTEM/hadoop/hadoop-2.7.0
export HADOOP_MAPRED_HOME=/media/SYSTEM/hadoop/hadoop-2.7.0
export HIVE_HOME=/media/SYSTEM/hadoop/hive/apache-hive-1.2.1

5. Setup MySQL drivers, to be used by SQOOP (for importing MySQL tables to Hive)


$ cd $SQOOP_HOME/lib
$ sudo wget http://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.1.38.tar.gz

$ sudo tar -zxvf mysql-connector-java-5.1.38.tar.gz$ sudo cp mysql-connector-java-5.1.38/mysql-connector-java-5.1.38-bin.jar ./

6. Reboot



7. Start hadoop


$ su hduser
$ cd
$ start-dfs.sh
$ start-yarn.sh

8. Copy Sqoop specific JAR's to HDFS

So that every datanode can access the libraries for Sqoop processing.
NB: Without this, SQOOP will not work properly!


$ hdfs dfs -mkdir -p $SQOOP_HOME/lib
$ hdfs dfs -copyFromLocal $SQOOP_HOME/lib/* $SQOOP_HOME/lib/
$ hdfs dfs -copyFromLocal $SQOOP_HOME/sqoop-1.4.6.jar $SQOOP_HOME/

9. Start MySQL

$ mysql -u mysqluser -p

10. Now Import a Table from MySQL to Hive using SQOOP


$ sqoop import –bindir ./ --connect jdbc:mysql://localhost:3306/scooptest --username mysqluser --password pass1 --table employee --hive-import --hive-overwrite

Note: See Appendix section, regarding MySQL installation and table setup, for this test.

Now take the 'hive' prompt, and see your data has been populated inside Hive tables


$ hive
$ select * from employee where id >= 2;

11. Stop Hadoop, Shutdown and Snapshot your VM


$ stop-all.sh
$ sudo shutdown now

Appendix:

MySQL installation and setting up some tables for the SQOOP Test.

MySQL User: 'mysqluser' (To be used for SQOOP import)
Database: sqooptest
Table: employee


$ sudo apt-get install mysql-server
$ mysql -u root -p
mysql> create database scooptest;
mysql> grant all on scooptest.* to 'mysqluser' identified by 'pass1';
mysql> use scooptest;
mysql> create table employee(id int primary key, name text);
mysql> insert into employee values (1, 'smith');
mysql> insert into employee values (2, 'john');
mysql> insert into employee values (3, 'henry');

Setting up Pig 0.15 - Hadoop @ Desk (Single Node Cluster)

Hope you've setup your Hadoop Single Node Cluster @ Your Desk.

In this tutorial, we will setup and test Pig 0.15.0. (Before you start, snapshot your VM, if not already done)

Note: You need to change paths as per your environment (i.e in my case I'm using '/media/SYSTEM', you've to replace it with yours)

Steps below:

1. Start your VM (Or Host, if you've installed Hadoop directly on Host)

2. Get Pig 0.15 and move to our dedicated partition (as that of Hadoop) for better management

$ su hduser $ cd $ wget http://www.eu.apache.org/dist/pig/latest/pig-0.15.0.tar.gz $ tar -xvf pig-0.15.0.tar.gz $ sudo mv pig-0.15.0 /media/SYSTEM/hadoop/pig/pig-0.15.0 $ sudo chown hduser pig

3. Update .bashrc file, to have 'Pig' specific configuration

$ vi .bashrc
#To avoid 'Found interface jline.Terminal, but class was expected'
#export HADOOP_USER_CLASSPATH_FIRST=false
#PIG VARIABLES START
export PIG_INSTALL=/media/SYSTEM/hadoop/pig/pig-0.15.0
export PATH=${PATH}:${PIG_INSTALL}/bin
#PIG VARIABLES END



NB: Please note to include 'HADOOP_USER_CLASSPATH_FIRST' environment variable, otherwise, Pig will have compatibility issues with Java Libraries



4. Editing configuration files for Pig

Add a 'pigbootup' file with empty content (Pig expects this file to auto populate its values)

By default Pig will write logs to the root partition. Move the logs file to a separate location, for better management.

$ touch ~/.pigbootup $ mkdir /media/SYSTEM/hadoop/pig/pig-0.15.0/logs $ vi /media/SYSTEM/hadoop/pig/pig-0.15.0/conf/pig.properties pig.logfile=/media/SYSTEM/hadoop/pig/pig-0.15.0/logs/

5. Reboot

6. Start hadoop

$ start-all.sh

7. Testing Pig (The famous `Word Count` Example - In MapReduce/Hadoop Mode)

$ su hduser
$ cd
$ cat > words.txt
this is a 
test file contains words
$ hdfs dfs -copyFromLocal words.txt words.txt
$ pig
grunt> A = load './words.txt';
grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
grunt> C = group B by word;
grunt> D = foreach C generate COUNT(B), group;
grunt> dump D

8 Stop Hadoop, Shutdown and Snapshot your VM

$ stop-all.sh

$ sudo shutdown now

Setting up Hive 1.2.1 - Hadoop @ Desk (Single Node Cluster)

Hope you've setup your Hadoop Single Node Cluster @ Your Desk.

In this tutorial, we will setup and test Hive 1.2.1. (Before you start, snapshot your VM, if not already done)

Note: You need to change paths as per your environment (i.e in my case I'm using '/media/SYSTEM', you've to replace it with yours)

Steps below: (I've compiled the steps from here and here)

1. Start your VM (Or Host, if you've installed Hadoop directly on Host)

2. Get Hive 1.2.1 and move to our dedicated partition (as that of Hadoop) for better management

$ su hduser
$ cd
$ wget http://archive.apache.org/dist/hive/stable/apache-hive-1.2.1-bin.tar.gz

$ tar -xzvf apache-hive-1.2.1-bin.tar.gz
$ mkdir -p /media/SYSTEM/hadoop/hive/apache-hive-1.2.1/
$ mv apache-hive-1.2.1-bin/ /media/SYSTEM/hadoop/hive/apache-hive-1.2.1/

3. Update .bashrc file, to have 'hive' specific configuration

$ vi ~/.bashrc

#HIVE VARIABLES START HIVE_HOME=/media/SYSTEM/hadoop/hive/apache-hive-1.2.1 export HIVE_HOME export PATH=$PATH:$HIVE_HOME/bin #HIVE VARIABLES END

4. Update hive-config.sh, to have Hadoop Home directory

Append the export line, to the file

$ vi /media/SYSTEM/hadoop/hive/apache-hive-1.2.1/bin/hive-config.sh

export HADOOP_HOME=/media/SYSTEM/hadoop/hadoop-2.7.0

5. Start hadoop, if not already done

$ start-all.sh


6. Create Hive specific directories

$ hadoop fs -mkdir /tmp && hadoop fs -mkdir -p /user/hive/warehouse && hadoop fs -chmod g+w /tmp && hadoop fs -chmod g+w /user/hive/warehouse
 

$ cp /media/SYSTEM/hadoop/hive/apache-hive-1.2.1/lib/jline-2.12.jar /media/SYSTEM/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/

7. Stop Hadoop and Reboot

$ stop-all.sh $ sudo reboot

8. Start Hadoop and then Hive

$ su hduser $ cd $ start-all.sh $ hive

9. Test Hive (On the hive prompt Create a Hive Table, Do some insert and Select)

hive> !clear;
hive> create table employee(name string, id int);
hive> insert into employee values('george',1);
hive> insert into employee values('mathew',2);
hive> select name from employee where id = 2;
hive>quit;

Note: you can actually see Map Reduce jobs are being created on the fly while executing these commands

11. Stop Hadoop, Shutdown and Snapshot your VM

$ stop-all.sh

$ sudo shutdown now