A step towards 'Data Science'..!!: 2015

Sunday, 20 December 2015

Get started with HIVE...

What is HIVE ??

Apache Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hadoop is a framework for handling large datasets in adistributed computing environment.

Getting Started

We will be using the same data that we used in our hive tutorial. Namely, files batting.csv and master.csv.

Data Source

http://seanlahman.com/files/database/lahman591-csv.zip

Accessing Hue

You can access HUE from the entering the address 127.0.0.1:8000

Pwd : 1111

Uploading Data

Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.

Writing Queries

You click on the query editor to go to the query page.

Queries

The 1st step we do is create a temp_batting table with the query.

create table temp_batting (col_value STRING);

You can see tables by clicking 'Table' option.

Loading CSV file

LOAD DATA INPATH '/user/hue/Batting.csv' OVERWRITE INTO TABLE temp_batting;

SELECT * FROM temp_batting LIMIT 100;

Creating table

create table batting (player_id STRING, year INT, runs INT);

Now we extract the data we want from temp_batting and copy it into batting. We do it with a regexp pattern and build a multi line query.

The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.

insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1)player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;

Once the query is executed we can see the job status by entering the below address in your web browser 127.0.0.1:8088 and you can get a status as below.

Grouping Data

Which player has scored how much for given runs ?
SELECT a.year, a.player_id, a.runs from batting a JOIN(SELECT year, max(runs) runs FROM batting GROUP BY year )b ON (a.year = b.year AND a.runs = b.runs) ;

max runs_year1.jpg

Thank you.

Sunday, 6 December 2015

Hadoop with Pig...

Introduction

Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, is a simple query algebra that lets you express data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Users can create their own functions to do special-purpose processing.

Pig Latin queries execute in a distributed fashion on a cluster. Our current implementation compiles Pig Latin programs into Map-Reduce jobs, and executes them using Hadoop cluster.

Problem statement

We are going to read in a baseball statistics file. We are going to compute the highest runs by a player for each year. This file has all the statistics from 1871–2011 and it contains over 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

Link for datasets: http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

We start virtual box to get hadoop started, then we will see a GUI screen like below.

Upload .csv files using upload button on right corner.

Now we can go to Pig console by clicking pig image on title bar which leads us to pig console where we can write, edit and save pig commands.

Steps:

1. We need to load the data first. For that we use load statement.

2. To filter out the first row of the data we add FILTER statement.

3. Now we name the fields, We will use FOREACH statement to iterate batting data object. We can use Pig helper to provide us with a template if required.So the FOREACH statement will iterate through the batting data object and GENERATE pulls out selected fields and assigns them names. The new data object we are creating is then named runs.

4. We will use GROUP statement to group the elements in runs by the year field.

5. We will use FOREACH statement to find the maximum runs for each year.

6. We use the maximum runs we need to join this with the runs data object so we can pick up the player.

7. The result will be a dataset with "Year, PlayerID and Max Run".

8. At the end we DUMP data to the output.

Save and execute-

We can see job status-

Results and logs in query history-

We can see the results as we mentioned in the code as "Year", "Player_id", and "Max_run".

We should always check the log to see if the script was executed correctly.

Saturday, 21 November 2015

first hadoop job..

What is Hadoop ?
Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data.

Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop. Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles.

What is HDFS ?

Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

How to install Hadoop ?
Hadoop version used in this example is HortonWorks Sandbox with HDP 2.3. It can be mounted using Orcale VM Virtual Box or VMware virtul box. Your computer should have 6 GB of memory to run it smoothly. Once you start the virtual machine it should take about 2-3 minutes for the machine to be ready to be used.

Starting with Hadoop
Hadoop can be accessed on the your browser at the address127.0.0.1:8888. Once at this page you need to access the Secure Shell Client at 127.0.0.1:4200. The below screenshot will give you an idea if you are on the right path.

Initial login credentials

Login: root

Password: hadoop

First step will be creating WCclasses with following command : >mkdir WCclasses

Then, you can write java programs- in VI editor, WordMapper.java, SumReducer.java, WordCount.java as follows:

WordMapper.java

SumReducer.java

WordCount.java

Compile and execution step

Commands to compile the java command are

javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WCclasses WordMapper.java

This will compile all 3 java files and keep the compiled class files in the directory WCclasses.

The below command will create a JAR file in the directory WCclasses

Now you can go ahead creating a jar file,

jar -cvf WordCount.jar -C WCclasses/ .

Once the jar file is created you need to create the input directory in the hdfs file system using the below command.

hdfs dfs -mkdir /user/ru1
hdfs dfs -ls /user/ru1
hdfs dfs -mkdir /user/ru1/wc-inp
hdfs dfs -ls /user/ru1/wc-inp

Loading input files into HUE

Just access HUE on address 127.0.0.1:8000 and drag drop your input test files.

Execution step

You can track all your job status on 127.0.0.1:8000

Program output

Hope this helps you get you familiar with hadoop ecosystem.

Friday, 2 October 2015

SAS problems

Chapter 15 : Creating Customized Reports

Although you can customize the output produced by PROC PRINT, there are times when you need a bit more control over the appearance of your report. PROC REPORT was developed to fit this need. Not only can you control the appearance of every column of your report, you can produce summary reports as well as detail listings.

Code: ch15_6.sas

/*6.Using the SAS data set Blood Pressure, produce a report showing Gender, Age, SBP,
and DBP. Order the report in Gender and Age order as shown here:
Subject's in Gender and Age Order
Systolic Diastolic
Blood Blood
Gender Age Pressure Pressure
F 28 120 70
48 138 88
68 110 62
78 132 76
M 23 144 90
35 142 82
45 150 96
55 130 80*/

title "Subject's in Gender and Age Order";
proc report data=a15022.bloodpressure nowd;
column Gender Age SBP DBP;
define Gender / order width=6;
define Age / order width=5;
define SBP / display "Systolic Blood Pressure" width=8;
define DBP / display "Diastolic Blood Pressure" width=9;
run;

Learning:

Program defines format for each column such as age, blood pressure. It outputs a report which is easy and nice to read and understand.

Output:

Subject's in Gender and Age Order

Gender	Age	Systolic Blood Pressure	Diastolic Blood Pressure
F	28	120	70
	48	138	88
	68	110	62
	78	132	76
M	23	144	90
	35	142	82
	45	150	96
	55	130	80

Code: ch15_8.sas

/*

Using the data set Blood, produce a report like the one here. The numbers in the table

are the average WBC and RBC counts for each combination of blood type and

gender.

Average Blood Counts by Gender

ƒƒƒƒƒƒGenderƒƒƒƒƒƒ ƒƒƒƒƒƒGenderƒƒƒƒƒƒ

Blood Female Male Female Male

Type WBC WBC RBC RBC

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

A 7,218 7,051 5.47 5.46

AB 7,421 6,893 5.43 5.69

B 6,716 6,991 5.52 5.42

O 7,050 6,930 5.53 5.48

*/

/*blood dataset*/

data a15022.blood;

   infile '/folders/myfolders/programs/blood.txt' truncover;

   length Gender $ 6 BloodType $ 2 AgeGroup $ 5;

   input Subject 

         Gender 

         BloodType 

         AgeGroup

         WBC 

         RBC 

         Chol;

   label Gender = "Gender"

         BloodType = "Blood Type"

         AgeGroup = "Age Group"

         Chol = "Cholesterol";

run;



title "Average Blood Counts by Gender";

proc report data=a15022.blood nowd headline;

column BloodType Gender,WBC Gender,RBC;

define BloodType / group 'Blood Type' width=5;

define Gender / across width=8 '-Gender-';

define WBC / analysis mean format=comma8.;

define RBC / analysis mean format=8.2;

run;

Learning:

We could aggregate data by gender and then by blood type using above code.

Output:

Average Blood Counts by Gender

	Gender		Gender
	Female	Male	Female	Male
Blood Type	WBC	WBC	RBC	RBC
A	7,218	7,051	5.47	5.46
AB	7,421	6,893	5.43	5.69
B	6,716	6,991	5.52	5.42
O	7,050	6,930	5.53	5.48

Chapter 16: Summarizing your data

You may have thought of PROC MEANS (or PROC SUMMARY) primarily as a way to
generate summary reports, reporting the sums and means of your numeric variables.
However, these procedures are much more versatile and can be used to create summary
data sets that can then be analyzed with more DATA or PROC steps.
All the examples in this chapter use PROC MEANS rather than PROC SUMMARY,
even when all you want is an output data set. The reason for this is that using PROC
MEANS with a NOPRINT option is identical to using PROC SUMMARY.

Code: Ch16_4.sas

/*Chapter 16*/



/*Repeat Problem 3 (CLASS statement only), except group small and medium school

sizes together. Do this by writing a new format for SchoolSize (values are S, M, and

L). Do not use any DATA steps.*/



/*College Dataset*/

proc format library=a15022;

   value $yesno 'Y','1' = 'Yes'

                'N','0' = 'No'

                ' '     = 'Not Given';

   value $size 'S' = 'Small'

               'M' = 'Medium'

               'L' = 'Large'

                ' ' = 'Missing';

   value $gender 'F' = 'Female'

                 'M' = 'Male'

                 ' ' = 'Not Given';

run;



data a15022.college;

   length StudentID $ 5 Gender SchoolSize $ 1;

   do i = 1 to 100;

      StudentID = put(round(ranuni(123456)*10000),z5.);

      if ranuni(0) lt .4 then Gender = 'M';

      else Gender = 'F';

      if ranuni(0) lt .3 then SchoolSize = 'S';

      else if ranuni(0) lt .7 then SchoolSize = 'M';

      else SchoolSize = 'L';

      if ranuni(0) lt .2 then Scholarship = 'Y';

      else Scholarship = 'N';

      GPA = round(rannor(0)*.5 + 3.5,.01);

      if GPA gt 4 then GPA = 4;

      ClassRank = int(ranuni(0)*60 + 41);

      if ranuni(0) lt .1 then call missing(ClassRank);

      if ranuni(0) lt .05 then call missing(SchoolSize);

      if ranuni(0) lt .05 then call missing(GPA);

      output;

   end;

   format Gender $gender1. 

          SchoolSize $size. 

          Scholarship $yesno.;

   drop i;

run;



/*PRogram;*/



proc format;

value $groupsize

'S','M' = 'Small and Medium'

'L' = 'Large';

run;

title "Statistics on the College Data Set";

title2 "Broken down by School Size";

proc means data=a15022.college

n mean median min max maxdec=2;

class SchoolSize;

var ClassRank GPA;

format SchoolSize $groupsize.;

run;

Learning:

Usage of proc means to generate more detail report and finding specific statistic like min, max, median.

Output:

Statistics on the College Data Set

Broken down by School Size

The MEANS Procedure

SchoolSize	N Obs	Variable	N	Mean	Median	Minimum	Maximum
Large	18	ClassRank GPA	15 16	72.80 3.61	71.00 3.61	45.00 2.98	98.00 4.00
Small and Medium	78	ClassRank GPA	69 75	71.83 3.47	73.00 3.62	41.00 2.29	100.00 4.00

Code: Ch16_5.sas



/*Using the SAS data set College, report the mean GPA for the following categories of

ClassRank: 0–50 = bottom half, 51–74 = 3rd quartile, and 75 to 100 = top

quarter. Do this by creating an appropriate format. Do not use a DATA step.*/



proc format;

value rank 0-50 = 'Bottom Half'

51-74 = 'Third Quartile'

75-100 = 'Top Quarter';

run;

title "Statistics on the College Data Set";

title2 "Broken down by School Size";

proc means data=a15022.college

n mean maxdec=2;

class ClassRank;

var GPA;

format ClassRank rank.;

run;

Learning:

In above program we are diving students in quartiles according to their marks.

Output:

Statistics on the College Data Set

Broken down by School Size

The MEANS Procedure

Analysis Variable : GPA
ClassRank	N Obs	N	Mean
Bottom Half	15	14	3.78
Third Quartile	34	34	3.52
Top Quarter	39	35	3.42

Chapter 7: Counting frequencies

PROC FREQ can be used to count frequencies of both character and numeric variables, in one-way, two-way, and three-way tables. In addition, you can use PROC FREQ to create output data sets containing counts and percentages. Finally, if you are statistically inclined, you can use this procedure to compute various statistics such as chi-square, odds ratio, and relative risk.

Code: Ch17_1.sas

/*Chapter 17*/



/*Using the SAS data set Blood, generate one-way frequencies for the variables

Gender, BloodType, and AgeGroup. Use the appropriate options to omit the

cumulative statistics and percentages.*/



title "One-way Frequencies from BLOOD Data Set";

proc freq data=a15022.blood;

tables Gender BloodType AgeGroup / nocum nopercent;

run;

Learning:

In above program we learn usage of proc freq and tables statement to count and display one way frequencies.

Output:

One-way Frequencies from BLOOD Data Set

The FREQ Procedure

Gender
Gender	Frequency
Female	440
Male	560

Blood Type
BloodType	Frequency
A	412
AB	44
B	96
O	448

Age Group
AgeGroup	Frequency
Old	598
Young	402

Code: Ch17_2.sas





/*5. Program 17-2*/

/*

Using the SAS data set BloodPressure, generate frequencies for the variable Age.

Use a user-defined format to group ages into three categories: 40 and younger, 41 to

60, and 61 and older. Use the appropriate options to omit the cumulative statistics

and percentages.

*/

proc format;

value agegrp low-40 = '40 and lower'

41-60 = '41 to 60'

61-high = '61 and higher';

run;

title "Using a Format to Regroup Values";

proc freq data=a15022.bloodpressure;

tables age / nocum nopercent;

format age agegrp.;

run;

Learning: Above program shows frequency as per age group using tables statement.

Output:

Using a Format to Regroup Values

The FREQ Procedure

Age	Frequency
40 and lower	3
41 to 60	3
61 and higher	2

Please click on below link to find all the programs…