Tuesday, February 16, 2016

Setting up SPARK 1.6 - Hadoop @ Desk

Hope you've setup Hadoop @ Desk.

In this tutorial, we will setup SPARK 1.6. (Before you start, snapshot your VM, if not already done).
SPARK can run in standalone and well as in distributed mode. But to fully leverage its power, we are going to set it up on top of our core Hadoop installation. In this case SPARK will run in distributed mode over HDFS and is called 'SPARK over YARN`.

Note: You need to change paths as per your environment (i.e in my case I'm using '/media/SYSTEM', you've to replace it with yours)

Steps below:

1. Start your VM (Or Host, if you've installed Hadoop directly on Host)

2. Configure `HADOOP_CONF_DIR` environment variable to point to your Core Hadoop Configuration
(if not already done)


$ su hduser
$ cd
$ sudo leafpad ~/.bashrc
export HADOOP_CONF_DIR=/media/SYSTEM/hadoop/hadoop-2.7.0/etc/hadoop

3. Install Scala.
** This step is not required any more with new Spark builds


$ wget http://downloads.typesafe.com/scala/2.11.7/scala-2.11.7.tgz

$ sudo tar -xzf scala-2.11.7.tgz
$ sudo mkdir -p /media/SYSTEM/hadoop/scala/
$ sudo chown hduser /media/SYSTEM/hadoop/scala/
$ sudo mv scala-2.11.7 /media/SYSTEM/hadoop/scala/
$ sudo leafpad ~/.bashrc
#SCALA VARIABLES START
export SCALA_HOME=/media/SYSTEM/hadoop/scala/scala-2.11.7
export PATH=$PATH:$SCALA_HOME/bin
#SCALA VARIABLES END

 
4. Download and Setup SPARK 1.6

We've to download a compatible version of SPARK, which is built for the specific Hadoop version that we've. In this case we've a Hadoop 2.7 setup. So I've chosen latest SPARK, which is built for the given Hadoop version. Please see the below diagram, which shows how to pick the right version of SPARK for your Hadoop installation.

URL here.

sparkdownload

Once you choose your version, you will get a specific download URL, as seen in the above figure. That URL has been used to download the SPARK distribution below.

$ wget http://www.eu.apache.org/dist/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
$ sudo tar -xzf spark-1.6.0-bin-hadoop2.6.tgz
$ sudo mkdir -p /media/SYSTEM/hadoop/spark/
$ sudo chown hduser /media/SYSTEM/hadoop/spark/
$ sudo mv spark-1.6.0-bin-hadoop2.6 /media/SYSTEM/hadoop/spark/
$ sudo mv /media/SYSTEM/hadoop/spark/spark-1.6.0-bin-hadoop2.6 /media/SYSTEM/hadoop/spark/spark-1.6.0
$ vi ~/.bashrc
#SPARK VARIABLES START
export SPARK_HOME=/media/SYSTEM/hadoop/spark/spark-1.6.0
export PATH=$PATH:$SPARK_HOME/bin
#SPARK VARIABLES END

5. Now we've some configuration edits for SPARK (specific for Single Node Cluster)

The below configuration specifies, we're running a Single Node Cluster (localhost only). Also we've some default settings.


$ cd /media/SYSTEM/hadoop/spark/spark-1.6.0/conf
$ sudo cp spark-env.sh.template spark-env.sh
$ sudo leafpad spark-env.sh
export SPARK_MASTER_IP=localhost
export SPARK_WORKER_CORES=1<br>export SPARK_WORKER_MEMORY=800m
export SPARK_WORKER_INSTANCES=1

With below we will accept default settings for 'slaves' file. We will create a separate log directory for SPARK logs. Now we will update the Log4J settings, so that logs will be written to log-file in the specified log directory.

$ sudo cp slaves.template slaves
$ sudo leafpad slaves
$ sudo mkdir -p /media/SYSTEM/hadoop/spark/logs
$ sudo chown hduser /media/SYSTEM/hadoop/spark/logs
$ sudo cp log4j.properties.template log4j.properties
$ sudo leafpad log4j.properties
log4j.rootLogger=INFO, FILE
log4j.rootCategory=INFO, FILE
log4j.logger.org.eclipse.jetty=WARN
log4j.appender.FILE=org.apache.log4j.FileAppender
log4j.appender.FILE.File=/media/SYSTEM/hadoop/spark/logs/SparkOut.log
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

 

6. Test SPARK with YARN (distributed mode)

Start Hadoop and then SPARK under YARN.

$ start-dfs.sh 
//namenode UI should be up at http://localhost:50070/dfshealth.html#tab-overview
$ start-yarn.sh
//Yarn Cluster manager UI should be up at http://localhost:8088/cluster
$ spark-shell --master yarn
//this will start spark-shell under yarn cluster

Now Scala prompt will appear. Run some simple commands to run them on Hadoop.

scala> sc.parallelize( 2 to 200).count
//should return res0: Long = 199
scala> exit






You can also run some default program to run SPARK

$ run-example SparkPi 5


The UI corresponding to the SPARK shell can be found with the below URL

http://localhost:4040/jobs/

7. Famous Word Count Program

I've used the below example to count the word 'is' in an input file (newwords.txt) resides in HDFS root folder.


scala> val input = sc.textFile("newwords.txt")
scala> val splitedLines = input.flatMap(line => line.split(" ")).filter(x => x.equals("is"))
scala> System.out.println(splitedLines.count())

Note: I've copied the input file to HDFS using


hdfs dfs -copyFromLocal newwords.txt

8. Snapshot your VM

 

 

1 comment:

  1. Thanks for sharing this article.. You may also refer http://www.s4techno.com/blog/2016/07/11/hadoop-administrator-interview-questions/..

    ReplyDelete