Sunday, December 20, 2015

Prepare Yourself - Hadoop @ Desk (Single Node Cluster)

The whole idea about setting up Hadoop at your very desktop or laptop seems bit crazy. Especially, if you're having entry level configurations for your system. This is one of the primary reason, why many are staying out of Big Data development experience. Many thinks, they need servers or at least multiple machines (A ready made Lab environment) to setup Hadoop, and they never thought about playing it at home.

This was the same conviction I'd, till I realized a working Hadoop.2.0, at my desktop (with low specs: Dual Core Processor with 4Gig RAM). So I thought, sharing my experience will be helpful for many having the same thoughts. So in this blog, I will lay down a bit of prerequisites or environment preferences we need to make, for having a usable Hadoop installation at you're desktop/laptop.
Based on my experiences, I've found the below choices which are extremely important for the success of it. Or in other words, these are the same bottlenecks which restrict many from using Hadoop in a typical desktop or laptop.

A. Prefer a hands-on installation of Hadoop over Quick Start VM's

Many choose the easy way out for experiencing Hadoop. There are quick start VM's available from cloudera and  Hortonworks. I agree that, you can directly jump it to using the system, if you’ve a decent configuration. But it has the following cons.
i.  The prebuilt VM’s packs a huge set of modules, which you never use. But it will eat up your precious system resources, making your system clog down. You may only use a few modules in actual scenarios.
ii.  You never know the bare of you’re hadoop system, which are essentials to better understand your Map-Reduce programs and to troubleshoot low level issues
In essence, go for installing the Hadoop system on your own. Keep only the modules you actually wants.

e.g. I’ve installed Hadoop from scratch, starting from hadoop core.

B. Prefer a lightweight Desktop Host OS

Once you are ready to handle the Hadoop installation by yourself, the next hurdle is choosing the best desktop OS, which suits the need. As Hadoop may need good system resources, the key point is to choose a Host OS which is light on system resources as much as possible, with a descent GUI desktop and features for our daily use.

e.g. I’ve chosen, Lubuntu.14.04 LTS as my Host OS, which after boot-up consumes only 160MB of RAM with a working desktop.


C. Prefer Virtualization Technology over bare installation to Host system

Many get lost with Hadoop installation, when they mess up too much with their primary desktop OS. So the rule of thumb is, never pollute your primary desktop OS (Host OS). Keep it only for your daily tasks like browsing, document editing and other personal stuffs. Separate serious stuffs to Linux Containers or Virtual machines (Guest or Guest OS).
e.g I’ve setup the whole hadoop system inside a Virtual Machine and also inside Linux Containers.


D. Prefer a superior Virtualization Technology

Its also important to choose the right Virtualization Suite, as Virtualization itself add some performance overhead. So choose the one with the least overhead. Linux  Containers will be the top notch choice for this purpose. Both LXC and LXD is a best fit for setting up the cluster. If you go with Virtual Machines, Type1 hypervisors are more performers than Type2 ones. Many choose VMWare Player or VirtualBox which are essentially Type2 and significantly slower. Choose Type1 instead in such cases, like KVM (Though debates are still going on its Type1 status). If Docker/Containers works, it will be the best.
One more option to maximize the performance is to para-virtualize the guest OS virtual devices, than fully virtualizing them.

Other good thing about using containers/virtualization over host machine is, you can easily save the state of Container/Guest OS (Snapshots), at any point in time. Since hadoop installation steps are pretty long and error prone, once in a while take a snapshot of your Container/VM. When something goes wrong, restore the Container/VM back to the most previous working state, so that you wont lose all your past days precious hard work.

e.g I’ve chosen LXD/LXC Containers with ZFS to setup my cluster.
 
If your choice is Virtual Machines, I will recommend Qemu-KVM (Kernel Virtual Machine) under Linux and using internal snapshots. I’ve para-virtualized the Display/Disk/Network/Memory access. I’ve detailed both concepts in this blog.


E. Prefer a lightweight Guest OS

This step is similar to choosing your Host OS. Your Container/VM or Guest OS should also light as possible to have more resources for hadoop. If you are using Linxuc Container choose Ubuntu LTS Image. For Virtual Machines, You can choose Ubuntu Core or Ubuntu Server, if you’re comfortable with CLI. Ubuntu Core is as light as possible and will provide you the best experience with Hadoop than with others. You can also choose an OS with a GUI desktop, but select the one, which is very lightweight.

e.g. For my LXD container, I’ve chosen Trusty Container Image. For clusters using Virtual Machines, I’ve chosen, Lubuntu.14.04 LTS as my Guest OS as well.


We will see the actual Hadoop.2.0+ setup in Lubuntu (Host/Guest) in next blog. Thank you!

No comments:

Post a Comment