Hope you've setup Hadoop @ Desk.
In this tutorial, we will setup SPARK 1.6. (Before you start, snapshot your VM, if not already done). SPARK can run in standalone and well as in distributed mode. But to fully leverage its power, we are going to set it up on top of our core Hadoop installation. In this case SPARK will run in distributed mode over HDFS and is called 'SPARK over YARN`.
Note: You need to change paths as per your environment (i.e in my case I'm using '/media/SYSTEM', you've to replace it with yours)
Steps below:
1. Start your VM (Or Host, if you've installed Hadoop directly on Host)
2. Configure `HADOOP_CONF_DIR` environment variable to point to your Core Hadoop Configuration
(if not already done)
$ su hduser $ cd $ sudo leafpad ~/.bashrc export HADOOP_CONF_DIR=/media/SYSTEM/hadoop/hadoop-2.7.0/etc/hadoop
** This step is not required any more with new Spark builds
$ wget http://downloads.typesafe.com/scala/2.11.7/scala-2.11.7.tgz$ sudo tar -xzf scala-2.11.7.tgz
$ sudo mkdir -p /media/SYSTEM/hadoop/scala/
$ sudo chown hduser /media/SYSTEM/hadoop/scala/
$ sudo mv scala-2.11.7 /media/SYSTEM/hadoop/scala/
$ sudo leafpad ~/.bashrc
#SCALA VARIABLES START
export SCALA_HOME=/media/SYSTEM/hadoop/scala/scala-2.11.7
export PATH=$PATH:$SCALA_HOME/bin
#SCALA VARIABLES END
We've to download a compatible version of SPARK, which is built for the specific Hadoop version that we've. In this case we've a Hadoop 2.7 setup. So I've chosen latest SPARK, which is built for the given Hadoop version. Please see the below diagram, which shows how to pick the right version of SPARK for your Hadoop installation.
URL here.
Once you choose your version, you will get a specific download URL, as seen in the above figure. That URL has been used to download the SPARK distribution below.
$ wget http://www.eu.apache.org/dist/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz$ sudo tar -xzf spark-1.6.0-bin-hadoop2.6.tgz $ sudo mkdir -p /media/SYSTEM/hadoop/spark/ $ sudo chown hduser /media/SYSTEM/hadoop/spark/ $ sudo mv spark-1.6.0-bin-hadoop2.6 /media/SYSTEM/hadoop/spark/ $ sudo mv /media/SYSTEM/hadoop/spark/spark-1.6.0-bin-hadoop2.6 /media/SYSTEM/hadoop/spark/spark-1.6.0 $ vi ~/.bashrc #SPARK VARIABLES START export SPARK_HOME=/media/SYSTEM/hadoop/spark/spark-1.6.0 export PATH=$PATH:$SPARK_HOME/bin #SPARK VARIABLES END
The below configuration specifies, we're running a Single Node Cluster (localhost only). Also we've some default settings.
With below we will accept default settings for 'slaves' file. We will create a separate log directory for SPARK logs. Now we will update the Log4J settings, so that logs will be written to log-file in the specified log directory.$ cd /media/SYSTEM/hadoop/spark/spark-1.6.0/conf $ sudo cp spark-env.sh.template spark-env.sh $ sudo leafpad spark-env.sh export SPARK_MASTER_IP=localhost export SPARK_WORKER_CORES=1<br>export SPARK_WORKER_MEMORY=800m export SPARK_WORKER_INSTANCES=1
$ sudo cp slaves.template slaves $ sudo leafpad slaves $ sudo mkdir -p /media/SYSTEM/hadoop/spark/logs $ sudo chown hduser /media/SYSTEM/hadoop/spark/logs $ sudo cp log4j.properties.template log4j.properties $ sudo leafpad log4j.properties log4j.rootLogger=INFO, FILE log4j.rootCategory=INFO, FILE log4j.logger.org.eclipse.jetty=WARN log4j.appender.FILE=org.apache.log4j.FileAppender log4j.appender.FILE.File=/media/SYSTEM/hadoop/spark/logs/SparkOut.log log4j.appender.FILE.layout=org.apache.log4j.PatternLayout log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
6. Test SPARK with YARN (distributed mode)
Start Hadoop and then SPARK under YARN.
Now Scala prompt will appear. Run some simple commands to run them on Hadoop.$ start-dfs.sh //namenode UI should be up at http://localhost:50070/dfshealth.html#tab-overview$ start-yarn.sh //Yarn Cluster manager UI should be up at http://localhost:8088/cluster$ spark-shell --master yarn //this will start spark-shell under yarn cluster
scala> sc.parallelize( 2 to 200).count //should return res0: Long = 199 scala> exit
You can also run some default program to run SPARK
$ run-example SparkPi 5
The UI corresponding to the SPARK shell can be found with the below URL
http://localhost:4040/jobs/
7. Famous Word Count Program
I've used the below example to count the word 'is' in an input file (newwords.txt) resides in HDFS root folder.
scala> val input = sc.textFile("newwords.txt") scala> val splitedLines = input.flatMap(line => line.split(" ")).filter(x => x.equals("is")) scala> System.out.println(splitedLines.count())
hdfs dfs -copyFromLocal newwords.txt
8. Snapshot your VM
Thanks for sharing this article.. You may also refer http://www.s4techno.com/blog/2016/07/11/hadoop-administrator-interview-questions/..
ReplyDelete