Saturday, 21 November 2015

first hadoop job..

What is Hadoop ?
Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data.
Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop.  Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles.

What is HDFS ?
Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

How to install Hadoop ?
Hadoop version used in this example is HortonWorks Sandbox with HDP 2.3.  It can be mounted using Orcale VM Virtual Box or VMware virtul box.  Your computer should have 6 GB of memory to run it smoothly. Once you start the virtual machine it should take about 2-3 minutes for the machine to be ready to be used.

Starting with Hadoop
Hadoop can be accessed on the your browser at the address127.0.0.1:8888. Once at this page you need to access the Secure Shell Client at 127.0.0.1:4200.  The below screenshot will give you an idea if you are on the right path.


Initial login credentials
Login: root
Password: hadoop


First step will be creating WCclasses with following command : >mkdir WCclasses

Then, you can write java programs-  in VI editor, WordMapper.java, SumReducer.java, WordCount.java as follows:

WordMapper.java





SumReducer.java


WordCount.java



Compile and execution step

Commands to compile the java command are
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WCclasses WordMapper.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WCclasses SumReducer.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar: -d WCclasses WordCount.java
This will compile all 3 java files and keep the compiled class files in the directory WCclasses.
The below command will create a JAR file in the directory WCclasses
Now you can go ahead creating a jar file,
jar -cvf WordCount.jar -C WCclasses/ .
Once the jar file is created you need to create the input directory in the hdfs file system using the below command.
hdfs dfs -mkdir /user/ru1
hdfs dfs -ls /user/ru1
hdfs dfs -mkdir /user/ru1/wc-inp
hdfs dfs -ls /user/ru1/wc-inp
Loading input files into HUE
Just access HUE on address 127.0.0.1:8000 and drag drop your input test files.
Execution step

You can track all your job status on 127.0.0.1:8000


Program output


Hope this helps you get you familiar with hadoop ecosystem.