Pages

Saturday, May 24, 2014

Hadoop DistCp



DistCp (distributed copy) is a tool for large inter/intra-cluster copying.

Run the DistCp command for inter-cluster.
$ hadoop distcp hdfs://nn1:8020/basePath hdfs://nn2:8020/basePath
$ hadoop distcp hdfs://nn1:8020/basePath1 hdfs://nn1:8020/basePath2 hdfs://nn2:8020/basePath
$ hadoop distcp hdfs://nn1:8020/srclist hdfs://nn2:8020/basePath
Where srclist contains hdfs://nn1:8020/basePath1 and hdfs://nn1:8020/basePath2.

This will expand the namespace under /basepath on nn1 into a temporary file, partition its contents among a map tasks, and start a copy on each TaskTracker from nn1 to nn2.

Run the DistCp command on the destination cluster only for copying between different versions of Hadoop.

$ hadoop distcp hftp://cdh3-namenode:50070/ hdfs:/cdh4-namenode/

For specific path such as /hbase
$ hadoop distcp hftp://cdh3-namenode:50070/basePath hdfs:/cdh4-namenode/basePath

Where cdh3-namenode refers to the source NameNode hostname as defined by the config  fs.default.name and 50070 is the NameNode port as defined by the config dfs.http.address, and cdh4-namenode is destination NameNode as defined by the config fs.defaultFS. You can also use destination nameservice-id. basePath in both above URIs is the directory to copy.


Saturday, May 17, 2014

Running a MapReduce Job



WordCount is a simple application that counts the number of occurrences of each word in an input set.

1.       Create the input directory in HDFS.
# useradd cloudera
$ sudo su hdfs                                                                                    
$ hadoop fs -mkdir /user/cloudera
$ hadoop fs -chown cloudera /user/cloudera
$ exit
$ sudo su - cloudera
$ pwd
/home/cloudera
$ hadoop fs -mkdir /user/cloudera/wordcount /user/cloudera/wordcount/input

2.       Create sample text files and copy the files into HDFS under the input directory.
$ echo "Hello World Bye World" > file0
$ echo "Hello Hadoop Goodbye Hadoop" > file1
$ hadoop fs -put file* /user/cloudera/wordcount/input

3.       Create a java program.
$ vi WordCount.java

package org.myorg;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WordCount {

public static class Map extends MapReduceBase implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

public static class Reduce extends MapReduceBase implements Reducer {
public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}
}

4.       Compile WordCount.java.
$ mkdir wordcount_classes
$ javac -cp /opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop/client-0.20/* -d wordcount_classes WordCount.java

5.       Create a JAR.
$ jar -cvf wordcount.jar -C wordcount_classes/ .
added manifest
adding: org/(in = 0) (out= 0)(stored 0%)
adding: org/myorg/(in = 0) (out= 0)(stored 0%)
adding: org/myorg/WordCount$Map.class(in = 1938) (out= 798)(deflated 58%)
adding: org/myorg/WordCount$Reduce.class(in = 1611) (out= 649)(deflated 59%)
adding: org/myorg/WordCount.class(in = 1546) (out= 749)(deflated 51%)

6.       Run the application.
$ hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input /user/cloudera/wordcount/output
14/02/22 19:36:52 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/02/22 19:36:53 INFO mapred.FileInputFormat: Total input paths to process : 2
14/02/22 19:36:58 INFO mapred.JobClient: Running job: job_201402221622_0001
14/02/22 19:37:00 INFO mapred.JobClient:  map 0% reduce 0%
14/02/22 19:39:07 INFO mapred.JobClient:  map 33% reduce 0%
14/02/22 19:39:31 INFO mapred.JobClient:  map 67% reduce 0%
14/02/22 19:39:32 INFO mapred.JobClient:  map 100% reduce 0%
14/02/22 19:39:43 INFO mapred.JobClient:  map 100% reduce 100%
14/02/22 19:39:50 INFO mapred.JobClient: Job complete: job_201402221622_0001
14/02/22 19:39:50 INFO mapred.JobClient: Counters: 33
14/02/22 19:39:51 INFO mapred.JobClient:   File System Counters
14/02/22 19:39:51 INFO mapred.JobClient:     FILE: Number of bytes read=79
14/02/22 19:39:51 INFO mapred.JobClient:     FILE: Number of bytes written=651887
14/02/22 19:39:51 INFO mapred.JobClient:     FILE: Number of read operations=0
14/02/22 19:39:51 INFO mapred.JobClient:     FILE: Number of large read operations=0
14/02/22 19:39:51 INFO mapred.JobClient:     FILE: Number of write operations=0
14/02/22 19:39:51 INFO mapred.JobClient:     HDFS: Number of bytes read=413
14/02/22 19:39:51 INFO mapred.JobClient:     HDFS: Number of bytes written=41
14/02/22 19:39:51 INFO mapred.JobClient:     HDFS: Number of read operations=7
14/02/22 19:39:51 INFO mapred.JobClient:     HDFS: Number of large read operations=0
14/02/22 19:39:51 INFO mapred.JobClient:     HDFS: Number of write operations=2
14/02/22 19:39:51 INFO mapred.JobClient:   Job Counters
14/02/22 19:39:51 INFO mapred.JobClient:     Launched map tasks=3
14/02/22 19:39:51 INFO mapred.JobClient:     Launched reduce tasks=1
14/02/22 19:39:51 INFO mapred.JobClient:     Data-local map tasks=3
14/02/22 19:39:51 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=210815
14/02/22 19:39:51 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=10176
14/02/22 19:39:51 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/02/22 19:39:51 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/02/22 19:39:51 INFO mapred.JobClient:   Map-Reduce Framework
14/02/22 19:39:51 INFO mapred.JobClient:     Map input records=2
14/02/22 19:39:51 INFO mapred.JobClient:     Map output records=8
14/02/22 19:39:51 INFO mapred.JobClient:     Map output bytes=82
14/02/22 19:39:51 INFO mapred.JobClient:     Input split bytes=360
14/02/22 19:39:51 INFO mapred.JobClient:     Combine input records=8
14/02/22 19:39:51 INFO mapred.JobClient:     Combine output records=6
14/02/22 19:39:51 INFO mapred.JobClient:     Reduce input groups=5
14/02/22 19:39:51 INFO mapred.JobClient:     Reduce shuffle bytes=117
14/02/22 19:39:51 INFO mapred.JobClient:     Reduce input records=6
14/02/22 19:39:51 INFO mapred.JobClient:     Reduce output records=5
14/02/22 19:39:51 INFO mapred.JobClient:     Spilled Records=12
14/02/22 19:39:51 INFO mapred.JobClient:     CPU time spent (ms)=2630
14/02/22 19:39:51 INFO mapred.JobClient:     Physical memory (bytes) snapshot=566894592
14/02/22 19:39:51 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2479079424
14/02/22 19:39:51 INFO mapred.JobClient:     Total committed heap usage (bytes)=280698880
14/02/22 19:39:51 INFO mapred.JobClient:   org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
14/02/22 19:39:51 INFO mapred.JobClient:     BYTES_READ=50

7.       View the results of running job by selecting Activities > mapreduce1 Jobs.

8.       Examine the output.
$  hadoop fs -cat /user/cloudera/wordcount/output/part-00000
Bye                         1
Goodbye 1
Hadoop   2
Hello        2
World      2

9.       Remove the output directory so that you can run the sample again.   
$ hadoop fs -rm -r /user/cloudera/wordcount/output
Moved: 'hdfs://myhost2.example.com:8020/user/cloudera/wordcount/output' to trash at: hdfs://myhost2.example.com:8020/user/hdfs/.Trash/Current

Saturday, May 10, 2014

Cloudera Distribution of Hadoop CDH4 Installation



Creating Databases for the Cloudera Manager Services

Create and configure MySQL databases for the Cloudera Management Services (Activity Monitor, Service Monitor, Host Monitor, and Report Manager), the Hive Metastore and Cloudera Navigator (optional). The databases must be configured to support UTF-8 character set encoding.

For performance reasons, it’s generally a good idea to have the Hive Metastore Server on the database server.

By default, Cloudera Manager uses Derby for Oozie’s database and SQLite for Hue’s database. You can configure MySQL after Cloudera Manager is installed.

1.       Connect to MySQL on myhost1 as the root user.
$ mysql -u root -p
Enter password:

2.       Create a database for the Activity Monitor.
mysql> create database amon DEFAULT CHARACTER SET utf8;
mysql> grant all on amon.* TO 'amon'@'%' IDENTIFIED BY 'passwd';

3.       Create a database for the Service Monitor.
mysql> create database smon DEFAULT CHARACTER SET utf8;
mysql> grant all on smon.* TO 'smon'@'%' IDENTIFIED BY 'passwd';

4.       Create a database for the Host Monitor.
mysql> create database hmon DEFAULT CHARACTER SET utf8;
mysql> grant all on hmon.* TO 'hmon'@'%' IDENTIFIED BY 'passwd';

5.       Create a database for the Report Manager.
mysql> create database rman DEFAULT CHARACTER SET utf8;
mysql> grant all on rman.* TO 'rman'@'%' IDENTIFIED BY 'passwd';

6.       Create a database for Cloudera Navigator (optional).
mysql> create database nav DEFAULT CHARACTER SET utf8;
mysql> grant all on nav.* TO 'nav'@'%' IDENTIFIED BY 'passwd';

7.       Create a database for the Hive metastore. Create a separate metastore for each Hive service, if you have more than one.
mysql> create database hive DEFAULT CHARACTER SET utf8;
mysql> grant all on hive.* TO 'hive'@'%' IDENTIFIED BY 'passwd';

8.       Create a database for Hue.
mysql> create database hue DEFAULT CHARACTER SET utf8;
mysql> grant all on hue.* to 'hue'@'%' IDENTIFIED BY 'passwd';

9.       Create a database for Oozie.
mysql> create database oozie DEFAULT CHARACTER SET utf8;
mysql> grant all on oozie.* to 'oozie'@'%' IDENTIFIED BY 'passwd';


Installing the MySQL JDBC Connector

Install the JDBC connector on the Cloudera Manager Server host, as well as hosts to which you assign the Activity Monitor, Service Monitor, Host Monitor, Report Manager, Hive Metastore, and Cloudera Navigator roles. In this case, all are on same host.

Download JDBC Driver for MySQL from http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.35.tar.gz
# tar zxvf  mysql-connector-java-5.1.35.tar.gz
# cp  mysql-connector-java-5.1.35/mysql-connector-java-5.1.35-bin.jar /usr/share/java/
# ln -s /usr/share/java/mysql-connector-java-5.1.35-bin.jar /usr/share/java/mysql-connector-java.jar

Add the Cloudera Manager Repository

# cd /etc/yum.repos.d

Installing the Cloudera Manager Server

Install the Cloudera Manager Server either on the machine where the database is installed, or on a machine that has access to the database. This machine need not be a host in the cluster that you want to manage with Cloudera Manager. The Cloudera Manager Server does not require CDH4 to be installed on the same machine.

Note :

1.       Cloudera Manager requires Hadoop to be installed on all hosts, but Hadoop must not be configured and must not be running.

2.       The Activity Monitor in Cloudera Manager 4.0 requires the hue-plugins package to be installed on the JobTracker host, regardless of whether you are using Hue. If you are using Hue, the hue-plugins package must be installed on all hosts.

# yum -y install cloudera-manager-daemons cloudera-manager-server

Configuring the Database for the Cloudera Manager Server

As we are not using the embedded database, remove /etc/cloudera-scm-server/db.mgmt.properties.

Enable Cloudera Manager Server to connect to external database by running the script on Cloudera Manager Server host.

# /usr/share/cmf/schema/scm_prepare_database.sh mysql -h localhost -u abitra -p  --scm-host localhost scm scm scm

Verifying that we can write to /etc/cloudera-scm-server
Creating SCM configuration file in /etc/cloudera-scm-server
Executing:  /usr/java/jdk1.6.0_31/bin/java -cp /usr/share/java/mysql-connector-java.jar:/usr/share/cmf/schema/../lib/* com.cloudera.enterprise.dbutil.DbCommandExecutor /etc/cloudera-scm-server/db.properties com.cloudera.cmf.db.
[                          main] DbCommandExecutor              INFO  Successfully connected to database.
All done, your SCM database is configured correctly!

Retrieving the Database Host, User Name, or Password

# cat /etc/cloudera-scm-server/db.properties

Start the Cloudera Manager Server

# service cloudera-scm-server start

Configuring Services

Login Cloudera Manager Admin Console at http://myhost1.example.com:7180/

The default credentials are Username: admin Password: admin

In Welcome to Cloudera page, click Continue.

Please restart Cloudera Manager to allow the new license to take effect. Existing clusters or services will be unaffected by the restart of Cloudera Manager.

# service cloudera-scm-server restart
Stopping cloudera-scm-server:                              [  OK  ]
Starting cloudera-scm-server:                              [  OK  ]

After the Cloudera Manager server restarts, login again.

The Cloudera Manager will enable to choose the packages for the below services. Click Continue.

·         Apache Hadoop (Common, HDFS, MapReduce, YARN)
·         Apache HBase
·         Apache ZooKeeper
·         Apache Oozie
·         Apache Hive
·         Hue (Apache licensed)
·         Apache Flume
·         Cloudera Impala (Apache licensed)
·         Apache Sqoop
·         Cloudera Search (Apache licensed)

Click Continue to proceed with the installation.

Specify hosts for your CDH cluster installation including Cloudera Manager Server host and then click Search to find the cluster hosts.

myhost1, myhost2, myhost3, myhost4, myhost5

Cloudera Manager identified the hosts to configure them for CDH.
Select the hosts where you want to install CDH and click Continue.

On Cluster Installation page, select repository type you want to use for the installation. Choose Use Parcels. Under More Options, you can add repository for previous versions.
Select the specific releases of Impala and Solr to install on your hosts. If you do not want to install those products, choose None.
Select the specific release of Cloudera Manager Agent you want to install on your hosts.
Cloudera Manager and Cloudera Distribution of Hadoop (CDH) are comprised of a set of services. These services interact among each other and use databases to complete tasks.

Provide SSH Login credentials for root.

Cloudera manager daemons, cloudera manager agent and jdk get installed on the previous selected hosts. Click Continue.

Selected parcels get installed on the previous selected hosts. Click Continue.

Choose cluster services you want to install. You can choose one of the standard combinations: Core
Hadoop, Real-Time Delivery (previously known as HBase Services), Real-Time Query (which includes HDFS, Hive and Impala), or All Services; these combinations take into account the dependencies between the Hadoop services. Alternatively, you can choose Custom Services, and select the services individually.
                                 
Note:
Some services depend on others; for example, HBase requires HDFS and ZooKeeper.
The Cloudera Management Services, which are added to each package, are Cloudera Manager processes that run to support monitoring and management features in Cloudera Manager. Cloudera Navigator is a system to support enforcement of compliance with company policies for data stored in a Hadoop Distributed File System (HDFS) deployment.

After the selection of the services, customize the role assignments for each node in the cluster. Click Inspect Role Assignments.

On the Database Setup page, enter the information for the Service Monitor, Activity Monitor, Host Monitor, Report Manager, and Hive metastore databases. Click Test Connection to confirm that Cloudera Manager can communicate with the databases. This transaction takes two heartbeats to complete (about 30 seconds with the default heartbeat interval). If the test succeeds in all cases, click Continue; otherwise check and correct the information you have provided for the databases and then try the test again.

On Review Configuration Changes page, confirm the settings entered for file system paths, such as the NameNode Data Directory and the DataNode Data Directory. Supply the name of the mail server (it can be localhost), the mail server user, and the mail recipients.

The wizard starts the services on your cluster.

When all of the services are started, click Continue.

Start the Cloudera Manager Agent

# service cloudera-scm-agent start
Starting cloudera-scm-agent:                               [  OK  ]

Change the Default Administrator Password

Change the default administrator password as soon as beginning to use Cloudera Manager.

1.       From the Administration tab, select Users.
2.       Click the Change Password button next to the admin account.
3.       Enter a new password twice and then click Update.

Specifying the Racks for Hosts

Cloudera Manager includes internal rack awareness scripts, but you must specify the racks where the hosts in your cluster are located. If your cluster contains more than 10 hosts, Cloudera recommends that you specify the rack for each host. HDFS and MapReduce will automatically use the racks you specify.

1.       Click the Hosts tab.
2.       Select the host(s) for a particular rack.
3.       From Actions for Selected tab, click Assign Rack.
4.       Enter new rack name such as /rack1 and then click Confirm.

After assigning racks, restart affected services.

Checking Host Heartbeats

By default, every Agent must heartbeat successfully every 15 seconds.

1.       Click the Hosts tab.
2.       See a list of all the hosts along with the value of Last Heartbeat.

Configure MySQL JDBC driver for Hive

# ln -s /usr/share/java/mysql-connector-java.jar /usr/hive/lib/hive/mysql-connector-java.jar

Enabling Oozie Web Console

1.       Download ext-2.2.zip from http://extjs.com/deploy/ext-2.2.zip
2.       Extract the contents of the file to /usr/lib/oozie/libext on the Oozie server.
3.       On Oozie Service page, select Configuration > View and Edit.
4.       Check Enable Oozie Server Web Console.
5.       Click on Save Changes.
6.       Restart the Oozie Service.

Configuring MySQL for Oozie

By default, Cloudera Manager uses Derby for Oozie Database.

1.       On Oozie Service page, select Configuration > View and Edit.
2.       In the Category Pane, expand Oozie Server (Default) and click Database.
3.       Specify the settings for Oozie Server Database Type, Oozie Server Database Name, Oozie Server Database Host, Oozie Server Database User, and Oozie Server Database Password.
4.       Create symlink for MySQL connector.
# ln -s /usr/share/java/mysql-connector-java.jar /var/lib/oozie/mysql-connector-java.jar
5.       Start the Oozie Service.

Configuring MySQL for Hue

By default Cloudera Manager uses SQLite for Hue Database.

1.       From The Hue service instance page, click Actions > Stop. Confirm you want to stop the service by clicking Stop.
2.       Click Configuration > View and Edit. In the Category Pane, expand Service-Wide and click Database.
3.       Specify the settings for Hue Database Type, Hue Database Hostname, Hue Database Port, Hue Database Username, Hue Database Password, and Hue Database Name.
4.       Restart the Hue service.