Sunday, 20 December 2015

Get started with HIVE...

What is HIVE ??

Apache Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hadoop is a framework for handling large datasets in adistributed computing environment.

Getting Started
We will be using the same data that we used in our hive tutorial. Namely, files batting.csv and master.csv.
Data Source
Accessing Hue
You can access HUE from the entering the address 127.0.0.1:8000
Login Id : Hue
Pwd : 1111
Uploading Data
Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.
beeswax.jpg
Writing Queries
You click on the query editor to go to the query page.
Queries
The 1st step we do is create a temp_batting table with the query.

create table temp_batting (col_value STRING);
Creating Table Query


You can see tables by clicking 'Table' option.


Table Temp_batting


Loading CSV file

LOAD DATA INPATH '/user/hue/Batting.csv' OVERWRITE INTO TABLE temp_batting;


data loading into temp_batting


SELECT * FROM temp_batting LIMIT 100;

temp_batting data display


Creating table

create table batting (player_id STRING, year INT, runs INT);

batting table columns.jpg


Now we extract the data we want from temp_batting and copy it into batting.  We do it with a regexp pattern and build a multi line query.
The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.
insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1)player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;
Once the query is executed we can see the job status by entering the below address in your web browser 127.0.0.1:8088 and you can get a status as below.
Grouping Data

max runs_year1.jpg

Which player has scored how much for given runs ?
SELECT a.year, a.player_id, a.runs from batting a JOIN(SELECT year, max(runs) runs FROM batting GROUP BY year )b ON (a.year = b.year AND a.runs = b.runs) ;

max runs_year1.jpg

Thank you.




Sunday, 6 December 2015

Hadoop with Pig...

Introduction

Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, is a simple query algebra that lets you express data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Users can create their own functions to do special-purpose processing.
Pig Latin queries execute in a distributed fashion on a cluster. Our current implementation compiles Pig Latin programs into Map-Reduce jobs, and executes them using Hadoop cluster.

Problem statement
We are going to read in a baseball statistics file. We are going to compute the highest runs by a player for each year. This file has all the statistics from 1871–2011 and it contains over 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.
We start virtual box to get hadoop started, then we will see a GUI screen like below.

Upload .csv files using upload button on right corner.

Now we can go to Pig console by clicking pig image on title bar which leads us to pig console where we can write, edit and save pig commands.
Steps:
1. We need to load the data first. For that we use load statement.
2. To filter out the first row of the data we add FILTER statement. 
3. Now we name the fields, We will use FOREACH statement to iterate batting data object. We can use Pig helper to provide us with a template if required.So the FOREACH statement will iterate through the batting data object and GENERATE pulls out selected fields and assigns them names. The new data object we are creating is then named runs. 
4. We will use GROUP statement to group the elements in runs by the year field.
5. We will use FOREACH statement to find the maximum runs for each year.
6. We use the maximum runs we need to join this with the runs data object so we can pick up the     player.
7. The result will be a dataset with "Year, PlayerID and Max Run".
8. At the end we DUMP data to the output.


Save and execute-


We can see job status-

Results and logs in query history-

We can see the results as we mentioned in the code as "Year", "Player_id", and "Max_run".


We should always check the log to see if the script was executed correctly.


Saturday, 21 November 2015

first hadoop job..

What is Hadoop ?
Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data.
Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop.  Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles.

What is HDFS ?
Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

How to install Hadoop ?
Hadoop version used in this example is HortonWorks Sandbox with HDP 2.3.  It can be mounted using Orcale VM Virtual Box or VMware virtul box.  Your computer should have 6 GB of memory to run it smoothly. Once you start the virtual machine it should take about 2-3 minutes for the machine to be ready to be used.

Starting with Hadoop
Hadoop can be accessed on the your browser at the address127.0.0.1:8888. Once at this page you need to access the Secure Shell Client at 127.0.0.1:4200.  The below screenshot will give you an idea if you are on the right path.


Initial login credentials
Login: root
Password: hadoop


First step will be creating WCclasses with following command : >mkdir WCclasses

Then, you can write java programs-  in VI editor, WordMapper.java, SumReducer.java, WordCount.java as follows:

WordMapper.java





SumReducer.java


WordCount.java



Compile and execution step

Commands to compile the java command are
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WCclasses WordMapper.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WCclasses SumReducer.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar: -d WCclasses WordCount.java
This will compile all 3 java files and keep the compiled class files in the directory WCclasses.
The below command will create a JAR file in the directory WCclasses
Now you can go ahead creating a jar file,
jar -cvf WordCount.jar -C WCclasses/ .
Once the jar file is created you need to create the input directory in the hdfs file system using the below command.
hdfs dfs -mkdir /user/ru1
hdfs dfs -ls /user/ru1
hdfs dfs -mkdir /user/ru1/wc-inp
hdfs dfs -ls /user/ru1/wc-inp
Loading input files into HUE
Just access HUE on address 127.0.0.1:8000 and drag drop your input test files.
Execution step

You can track all your job status on 127.0.0.1:8000


Program output


Hope this helps you get you familiar with hadoop ecosystem.

Friday, 2 October 2015

SAS problems

Chapter 15 : Creating Customized Reports

Although you can customize the output produced by PROC PRINT, there are times when you need a bit more control over the appearance of your report. PROC REPORT was developed to fit this need. Not only can you control the appearance of every column of your report, you can produce summary reports as well as detail listings.

Code: ch15_6.sas

/*6.Using the SAS data set Blood Pressure, produce a report showing Gender, Age, SBP,
and DBP. Order the report in Gender and Age order as shown here:
Subject's in Gender and Age Order
Systolic Diastolic
Blood Blood
Gender Age Pressure Pressure
F 28 120 70
48 138 88
68 110 62
78 132 76
M 23 144 90
35 142 82
45 150 96
55 130 80*/

title "Subject's in Gender and Age Order";
proc report data=a15022.bloodpressure nowd;
column Gender Age SBP DBP;
define Gender / order width=6;
define Age / order width=5;
define SBP / display "Systolic Blood Pressure" width=8;
define DBP / display "Diastolic Blood Pressure" width=9;
run;

Learning:
Program defines format for each column such as age, blood pressure. It outputs a report which is easy and nice to read and understand.

Output:

Subject's in Gender and Age Order
Gender
Age
Systolic Blood Pressure
Diastolic Blood Pressure
F
28
120
70

48
138
88

68
110
62

78
132
76
M
23
144
90

35
142
82

45
150
96

55
130
80

Code: ch15_8.sas

/*

Using the data set Blood, produce a report like the one here. The numbers in the table

are the average WBC and RBC counts for each combination of blood type and

gender.

Average Blood Counts by Gender

Æ’Æ’Æ’Æ’Æ’Æ’GenderÆ’Æ’Æ’Æ’Æ’Æ’ Æ’Æ’Æ’Æ’Æ’Æ’GenderÆ’Æ’Æ’Æ’Æ’Æ’

Blood Female Male Female Male

Type WBC WBC RBC RBC

Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’Æ’

A 7,218 7,051 5.47 5.46

AB 7,421 6,893 5.43 5.69

B 6,716 6,991 5.52 5.42

O 7,050 6,930 5.53 5.48

*/

/*blood dataset*/

data a15022.blood;

   infile '/folders/myfolders/programs/blood.txt' truncover;

   length Gender $ 6 BloodType $ 2 AgeGroup $ 5;

   input Subject 

         Gender 

         BloodType 

         AgeGroup

         WBC 

         RBC 

         Chol;

   label Gender = "Gender"

         BloodType = "Blood Type"

         AgeGroup = "Age Group"

         Chol = "Cholesterol";

run;



title "Average Blood Counts by Gender";

proc report data=a15022.blood nowd headline;

column BloodType Gender,WBC Gender,RBC;

define BloodType / group 'Blood Type' width=5;

define Gender / across width=8 '-Gender-';

define WBC / analysis mean format=comma8.;

define RBC / analysis mean format=8.2;

run;

Learning:
We could aggregate data by gender and then by blood type using above code.
Output:
Average Blood Counts by Gender

Gender
Gender

Female
Male
Female
Male
Blood Type
WBC
WBC
RBC
RBC
A
7,218
7,051
5.47
5.46
AB
7,421
6,893
5.43
5.69
B
6,716
6,991
5.52
5.42
O
7,050
6,930
5.53
5.48

Chapter 16: Summarizing your data

You may have thought of PROC MEANS (or PROC SUMMARY) primarily as a way to
generate summary reports, reporting the sums and means of your numeric variables.
However, these procedures are much more versatile and can be used to create summary
data sets that can then be analyzed with more DATA or PROC steps.
All the examples in this chapter use PROC MEANS rather than PROC SUMMARY,
even when all you want is an output data set. The reason for this is that using PROC
MEANS with a NOPRINT option is identical to using PROC SUMMARY
.

Code: Ch16_4.sas

/*Chapter 16*/



/*Repeat Problem 3 (CLASS statement only), except group small and medium school
sizes together. Do this by writing a new format for SchoolSize (values are S, M, and
L). Do not use any DATA steps.*/



/*College Dataset*/

proc format library=a15022;

   value $yesno 'Y','1' = 'Yes'

                'N','0' = 'No'

                ' '     = 'Not Given';

   value $size 'S' = 'Small'

               'M' = 'Medium'

               'L' = 'Large'

                ' ' = 'Missing';

   value $gender 'F' = 'Female'

                 'M' = 'Male'

                 ' ' = 'Not Given';

run;



data a15022.college;

   length StudentID $ 5 Gender SchoolSize $ 1;

   do i = 1 to 100;

      StudentID = put(round(ranuni(123456)*10000),z5.);

      if ranuni(0) lt .4 then Gender = 'M';

      else Gender = 'F';

      if ranuni(0) lt .3 then SchoolSize = 'S';

      else if ranuni(0) lt .7 then SchoolSize = 'M';

      else SchoolSize = 'L';

      if ranuni(0) lt .2 then Scholarship = 'Y';

      else Scholarship = 'N';

      GPA = round(rannor(0)*.5 + 3.5,.01);

      if GPA gt 4 then GPA = 4;

      ClassRank = int(ranuni(0)*60 + 41);

      if ranuni(0) lt .1 then call missing(ClassRank);

      if ranuni(0) lt .05 then call missing(SchoolSize);

      if ranuni(0) lt .05 then call missing(GPA);

      output;

   end;

   format Gender $gender1. 

          SchoolSize $size. 

          Scholarship $yesno.;

   drop i;

run;



/*PRogram;*/



proc format;

value $groupsize

'S','M' = 'Small and Medium'

'L' = 'Large';

run;

title "Statistics on the College Data Set";

title2 "Broken down by School Size";

proc means data=a15022.college

n mean median min max maxdec=2;

class SchoolSize;

var ClassRank GPA;

format SchoolSize $groupsize.;

run;

Learning:
Usage of proc means to generate more detail report and finding specific statistic like min, max, median.

Output:
Statistics on the College Data Set
Broken down by School Size
The MEANS Procedure
SchoolSize
N Obs
Variable
N
Mean
Median
Minimum
Maximum
Large
18
ClassRank
GPA
15
16
72.80
3.61
71.00
3.61
45.00
2.98
98.00
4.00
Small and Medium
78
ClassRank
GPA
69
75
71.83
3.47
73.00
3.62
41.00
2.29
100.00
4.00


Code: Ch16_5.sas



/*Using the SAS data set College, report the mean GPA for the following categories of

ClassRank: 0–50 = bottom half, 51–74 = 3rd quartile, and 75 to 100 = top

quarter. Do this by creating an appropriate format. Do not use a DATA step.*/



proc format;

value rank 0-50 = 'Bottom Half'

51-74 = 'Third Quartile'

75-100 = 'Top Quarter';

run;

title "Statistics on the College Data Set";

title2 "Broken down by School Size";

proc means data=a15022.college

n mean maxdec=2;

class ClassRank;

var GPA;

format ClassRank rank.;

run;

Learning:
In above program we are diving students in quartiles according to their marks.
Output:
Statistics on the College Data Set
Broken down by School Size
The MEANS Procedure
Analysis Variable : GPA
ClassRank
N Obs
N
Mean
Bottom Half
15
14
3.78
Third Quartile
34
34
3.52
Top Quarter
39
35
3.42

Chapter 7: Counting frequencies
PROC FREQ can be used to count frequencies of both character and numeric variables, in one-way, two-way, and three-way tables. In addition, you can use PROC FREQ to create output data sets containing counts and percentages. Finally, if you are statistically inclined, you can use this procedure to compute various statistics such as chi-square, odds ratio, and relative risk.

Code: Ch17_1.sas

/*Chapter 17*/



/*Using the SAS data set Blood, generate one-way frequencies for the variables

Gender, BloodType, and AgeGroup. Use the appropriate options to omit the

cumulative statistics and percentages.*/



title "One-way Frequencies from BLOOD Data Set";

proc freq data=a15022.blood;

tables Gender BloodType AgeGroup / nocum nopercent;

run;

Learning:
In above program we learn usage of proc freq and tables statement to count and display one way frequencies.

Output:

One-way Frequencies from BLOOD Data Set
The FREQ Procedure
Gender
Gender
Frequency
Female
440
Male
560

Blood Type
BloodType
Frequency
A
412
AB
44
B
96
O
448

Age Group
AgeGroup
Frequency
Old
598
Young
402

Code: Ch17_2.sas





/*5. Program 17-2*/

/*

Using the SAS data set BloodPressure, generate frequencies for the variable Age.

Use a user-defined format to group ages into three categories: 40 and younger, 41 to

60, and 61 and older. Use the appropriate options to omit the cumulative statistics

and percentages.

*/

proc format;

value agegrp low-40 = '40 and lower'

41-60 = '41 to 60'

61-high = '61 and higher';

run;

title "Using a Format to Regroup Values";

proc freq data=a15022.bloodpressure;

tables age / nocum nopercent;

format age agegrp.;

run;

Learning: Above program shows frequency as per age group using tables statement.

Output:

Using a Format to Regroup Values
The FREQ Procedure
Age
Frequency
40 and lower
3
41 to 60
3
61 and higher
2

Please click on below link to find all the programs…