Pages

Saturday, January 14, 2017

Install Hadoop, Hive and Spark in Cluster

Install Hadoop in cluster

On all Nodes

Update /etc/hosts
192.168.1.10 master-node-01
192.168.1.11 core-node-01
192.168.1.12 core-node-02

Uncomment in /etc/ssh/sshd_config
PasswordAuthentication yes

Configure passwordless SSH on master-node-01 only.

$ ssh-keygen
$ ssh-copy-id -i .ssh/id_rsa.pub hadoop@core-node-01
$ ssh-copy-id -i .ssh/id_rsa.pub hadoop@core-node-02

Install JDK

$ cd /tmp
$ wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http://www.oracle.com/; oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u151-b12/e758a0de34e24606bca991d704f6dcbf/jdk-8u161-linux-x64.rpm
$ sudo yum localinstall jdk-8u161-linux-x64.rpm

You can also install on all nodes by looping them in the same command.

$ cat ~/hosts.lst
master-node-01
core-node-01
core-node-02

$ for each in `cat ~/hosts.lst`;do echo $each; scp jdk-8u161-linux-x64.rpm $each:/tmp; done
$ for each in `cat ~/hosts.lst`;do echo $each; ssh -t -q $each "sudo su - -c 'sudo yum -y localinstall /tmp/jdk-8u161-linux-x64.rpm'";done

$ wget http://download.oracle.com/otn-pub/java/jce/8/jce_policy-8.zip
$ unzip jce_policy-8.zip
$ sudo cp /tmp/*policy.jar /usr/java/jdk1.8.0_161/jre/lib/security/

$ for each in `cat ~/hosts.lst`;do echo $each; scp /tmp/*policy.jar $each:/tmp; done
$ for each in `cat ~/hosts.lst`;do echo $each; ssh -t -q $each "sudo su - -c 'sudo cp /tmp/*policy.jar /usr/java/jdk1.8.0_161/jre/lib/security/'";done

$ ls -l /usr/java
total 4
lrwxrwxrwx 1 root root   16 Jul 22  2014 default -> /usr/java/latest
drwxr-xr-x 9 root root 4096 Mar  6 19:20 jdk1.8.0_161
lrwxrwxrwx 1 root root   22 Mar  6 19:20 latest -> /usr/java/jdk1.8.0_161

Install Hadoop

$ cd /tmp
$ wget http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.6/hadoop-2.7.6.tar.gz
$ cd /opt
$ sudo tar zxvf /tmp/hadoop-2.7.6.tar.gz
$ sudo ln -s hadoop-2.7.6 hadoop

Edit .bashrc

export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/opt/hadoop
PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$HADOOP_HOME/bin

Edit hadoop-env.sh
$ cd $HADOOP_HOME/etc/hadoop
export JAVA_HOME=/usr/java/latest

Edit slaves

core-node-01
core-node-02

Edit core-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master-node-01</value>            --> hostname or HA enabled logical URI
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
</property>
</configuration>

hdfs-site.xml on the NameNode:
The NameNode stores its metadata and edit logs.

<configuration>
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data1/dfs/nn</value>
</property>
</configuration>

$ sudo mkdir -p /data1/dfs/nn
$ sudo chown hdfs:hadoop /data1/dfs/nn
$ sudo chmod 700 /data1/dfs/nn

hdfs-site.xml on the DataNode:
Configure the disks on DataNode in a JBOD configuration. The DataNode stores HDFS blocks.

<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data1/dfs/dn,file:///data2/dfs/dn</value>
</property>
</configuration>

$ sudo mkdir -p /data1/dfs/dn /data2/dfs/dn
$ sudo chown hdfs:hadoop /data1/dfs/dn /data2/dfs/dn
$ sudo chmod 700 /data1/dfs/dn /data2/dfs/dn

Edit mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Edit yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>${yarn.resourcemanager.hostname}:8031</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>${yarn.resourcemanager.hostname}:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
</configuration>

On master-node-01, format NameNode.

$ sudo -u hdfs hdfs namenode -format

Start Services

$ bin/start-dfs.sh
$ bin/start-yarn.sh

Create the /tmp directory.
$ sudo -u hdfs hdfs dfs -mkdir /tmp
$ sudo -u hdfs hdfs dfs -chmod -R 1777 /tmp

Check daemons on Master
$ jps
NameNode
SecondaryNameNode
ResourceManger

Check daemons on Slaves
$ jps
DataNode
NodeManager

HDFS NameNode                   : http://master-node-01:50070/
HDFS DataNode                     : http://core-node-01:50075/
YARN Resource Manager      : http://master-node-01:8088/
YARN NodeManager              : http://core-node-01:8042/
MapReduce JobHistoryServer : http://master-node-01:19888/

Install MySQL and JDBC Connector

On NameNode, install MySQL as per instructions at http://anandbitra.blogspot.com/2014/08/installing-mysql-server-on-centos-7.html

mysql> create database metastore DEFAULT CHARACTER SET utf8;
mysql> grant all on metastore.* TO 'hive'@'%' IDENTIFIED BY 'passwd';

Install the JDBC connector
Download JDBC Driver for MySQL from http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.35.tar.gz
$ tar zxvf  mysql-connector-java-5.1.35.tar.gz
$ sudo cp  mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar /usr/share/java/
$ sudo ln -s /usr/share/java/mysql-connector-java-5.1.35-bin.jar /usr/share/java/mysql-connector-java.jar

Install Hive in cluster

$ cd /tmp
$ wget http://apache.cs.utah.edu/hive/hive-2.3.2/apache-hive-2.3.2-bin.tar.gz
$ cd /opt
$ sudo tar zxvf /tmp/apache-hive-2.3.2-bin.tar.gz
$ sudo ln -s apache-hive-2.3.2 hive

Edit .bashrc

export HIVE_HOME=/opt/hive
PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin
export CLASSPATH=/opt/hadoop/lib/*:/opt/hive/lib/*

Edit hive-env.sh

$ cp hive-env.sh.template hive-env.sh
export JAVA_HOME=/usr/java/latest
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop

Edit hive-site.xml

<configuration>
   <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
      <description>metadata is stored in a MySQL server</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
      <description>MySQL JDBC driver class</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hive</value>
      <description>user name for connecting to mysql server</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>passwd</value>
      <description>password for connecting to mysql server</description>
   </property>
</configuration>

$ hive
hive> use default;
hive> show tables;

Install Spark in cluster

$ cd /tmp
$ wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
$ cd /opt
$ sudo tar zxvf /tmp/spark-2.2.0-bin-hadoop2.7.tgz
$ sudo ln -s spark-2.2.0-bin-hadoop2.7 spark

Edit .bashrc

export SPARK_HOME=/opt/spark
PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin

Edit spark-env.sh

$ cp spark-env.sh.template spark-env.sh
export JAVA_HOME=/usr/java/latest
export SPARK_WORKER_CORES=4
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop

Add Slaves

$ cd $SPARK_HOME/conf

core-node-01
core-node-02

Start Spark Services

$ sbin/start-all.sh

Check daemons on Master
$ jps
Master

Check daemons on Slaves
$ jps
Worker

Spark HistoryServer : http://master-node-01:18080/

Configure Hive execution engine to use Spark
set hive.execution.engine=spark

$ pyspark

SparkContext available as sc, HiveContext available as sqlContext.
>>> from pyspark.context import SparkContext
>>> from pyspark.sql import HiveContext
>>> sqlContext = HiveContext(sc)
>>> sqlContext.sql("use default")
DataFrame[result: string]
>>> sqlContext.sql("show tables").show()