A step towards 'Data Science'..!!: December 2015

Sunday, 20 December 2015

Get started with HIVE...

What is HIVE ??

Apache Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hadoop is a framework for handling large datasets in adistributed computing environment.

Getting Started

We will be using the same data that we used in our hive tutorial. Namely, files batting.csv and master.csv.

Data Source

http://seanlahman.com/files/database/lahman591-csv.zip

Accessing Hue

You can access HUE from the entering the address 127.0.0.1:8000

Pwd : 1111

Uploading Data

Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.

Writing Queries

You click on the query editor to go to the query page.

Queries

The 1st step we do is create a temp_batting table with the query.

create table temp_batting (col_value STRING);

You can see tables by clicking 'Table' option.

Loading CSV file

LOAD DATA INPATH '/user/hue/Batting.csv' OVERWRITE INTO TABLE temp_batting;

SELECT * FROM temp_batting LIMIT 100;

Creating table

create table batting (player_id STRING, year INT, runs INT);

Now we extract the data we want from temp_batting and copy it into batting. We do it with a regexp pattern and build a multi line query.

The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.

insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1)player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run from temp_batting;

Once the query is executed we can see the job status by entering the below address in your web browser 127.0.0.1:8088 and you can get a status as below.

Grouping Data

Which player has scored how much for given runs ?
SELECT a.year, a.player_id, a.runs from batting a JOIN(SELECT year, max(runs) runs FROM batting GROUP BY year )b ON (a.year = b.year AND a.runs = b.runs) ;

max runs_year1.jpg

Thank you.

Sunday, 6 December 2015

Hadoop with Pig...

Introduction

Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, is a simple query algebra that lets you express data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Users can create their own functions to do special-purpose processing.

Pig Latin queries execute in a distributed fashion on a cluster. Our current implementation compiles Pig Latin programs into Map-Reduce jobs, and executes them using Hadoop cluster.

Problem statement

We are going to read in a baseball statistics file. We are going to compute the highest runs by a player for each year. This file has all the statistics from 1871–2011 and it contains over 90,000 rows. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

Link for datasets: http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

We start virtual box to get hadoop started, then we will see a GUI screen like below.

Upload .csv files using upload button on right corner.

Now we can go to Pig console by clicking pig image on title bar which leads us to pig console where we can write, edit and save pig commands.

Steps:

1. We need to load the data first. For that we use load statement.

2. To filter out the first row of the data we add FILTER statement.

3. Now we name the fields, We will use FOREACH statement to iterate batting data object. We can use Pig helper to provide us with a template if required.So the FOREACH statement will iterate through the batting data object and GENERATE pulls out selected fields and assigns them names. The new data object we are creating is then named runs.

4. We will use GROUP statement to group the elements in runs by the year field.

5. We will use FOREACH statement to find the maximum runs for each year.

6. We use the maximum runs we need to join this with the runs data object so we can pick up the player.

7. The result will be a dataset with "Year, PlayerID and Max Run".

8. At the end we DUMP data to the output.