A step towards 'Data Science'..!!: November 2015

What is Hadoop ?
Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data.

Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop. Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles.

What is HDFS ?

Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

How to install Hadoop ?
Hadoop version used in this example is HortonWorks Sandbox with HDP 2.3. It can be mounted using Orcale VM Virtual Box or VMware virtul box. Your computer should have 6 GB of memory to run it smoothly. Once you start the virtual machine it should take about 2-3 minutes for the machine to be ready to be used.

Starting with Hadoop
Hadoop can be accessed on the your browser at the address127.0.0.1:8888. Once at this page you need to access the Secure Shell Client at 127.0.0.1:4200. The below screenshot will give you an idea if you are on the right path.

Initial login credentials

Login: root

Password: hadoop

First step will be creating WCclasses with following command : >mkdir WCclasses

Then, you can write java programs- in VI editor, WordMapper.java, SumReducer.java, WordCount.java as follows:

WordMapper.java

SumReducer.java

WordCount.java

Compile and execution step

Commands to compile the java command are

javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WCclasses WordMapper.java

This will compile all 3 java files and keep the compiled class files in the directory WCclasses.

The below command will create a JAR file in the directory WCclasses

Now you can go ahead creating a jar file,

jar -cvf WordCount.jar -C WCclasses/ .

Once the jar file is created you need to create the input directory in the hdfs file system using the below command.

hdfs dfs -mkdir /user/ru1
hdfs dfs -ls /user/ru1
hdfs dfs -mkdir /user/ru1/wc-inp
hdfs dfs -ls /user/ru1/wc-inp

Loading input files into HUE

Just access HUE on address 127.0.0.1:8000 and drag drop your input test files.

Execution step