Sunday, 9 August 2015

Text analysis with Python..

What is Python ?

Python is an object-oriented, high level language, interpreted, dynamic and multipurpose programming language.
Python is easy to learn yet powerful and versatile scripting language.

Python for Data Science..

Python is an interpreted, dynamically-typed language with a precise and efficient syntax.
It is open source.
Large active community – You can easily find experts to answer on challenges you face
Growing Data Analytics Libraries like NumPy, SciPy, StatsModels, Scikit-Learn, Pandas, etc.

Let’s get started…

Objective of this blog is to get you started with data processing using python. We will learn fundamentals of working with text, some basic data structures in Python, reading data from text file and processing it to get some insights .

Data structures in Python…
List and Dictionaries are two important data structures in Python.

List
Used to store variables of same/different data type. It is a container that holds other objects in a given order. Different operation like insertion and deletion can be performed on lists.

Creating and printing a simple list
Code
Output
Output

Dictionary –
Dictionary stores data in key-value pair.
The pair i.e., key and value is known as item.

Creating and printing a dictionary

Output

Reading from a text file…
Following code snippet will give you idea on how to read from a text file.
We can further store this extracted in Python data structures to process it further.


Above information is just to get you an overview of python components, specially the one which we are going to use further in this blog. However, no doubt one would need more practice on the same topics..


Problem solving..

Let`s solve some problem with some of skills we tried to acquire and get started with data analysis.

The Problem :

Suppose you're a greengrocer, and you run a survey to see what radish varieties your customers prefer the most. You have your assistant type up the survey results into a text file on your computer, so you have 300 lines of survey data in the file Data FileEach line consists of a name, a hyphen, then a radish variety.
Now we will find out –

1.Which is the most popular radish variety ? (one who has got most votes)
2.Which is least popular radish variety ?
3.We can also find out fraud (People who voted more than once)


Now, we will move further with motive of printing which radish has got how many number of votes. This will give us insight on data.
Refer below code snippet 


Output

And, You can see which variety got how many votes.
Our Winner is : 'Champion' with highest 76 votes
And poor least popular radish : 'Red King'

Now, Who voted more than once ?
refer following code snippet..


And Here is your answer,








3 comments: