Lesson 2 Data Wrangling

Data Manipulation
- from files, databases, and Web APIs.

Dealing with messy data. What tools do you use?

Acquiring data
- Acquiring data is often not funcy.
- Find data on the internet.
- A lot of data stored in text files and on government websites.

Common data formats:
- CSV
- XML
- JSON(nested structure)

import pandas
def add_full_name(path_to_csv, path_to_new_csv):
    #Assume you will be reading in a csv file with the same columns that the
    #Lahman baseball data set has -- most importantly, there are columns
    #called 'nameFirst' and 'nameLast'.
    #1) Write a function that reads a csv
    #located at "path_to_csv" into a pandas dataframe and adds a new column
    #called 'nameFull' with a player's full name.
    #
    #For example:
    #   for Hank Aaron, nameFull would be 'Hank Aaron',
#
#2) Write the data in the pandas dataFrame to a new csv file located at
#path_to_new_csv
    #WRITE YOUR CODE HERE
    baseballData = pandas.read_csv(path_to_csv)
    baseballData['nameFull'] = baseballData['nameFirst'] + ' ' + baseballData['nameLast']
    baseballData.to_csv(path_to_new_csv)

Good job! You updated the CSV file correctly.
The following is the output by your program (showing only the name related columns):

0,Hank,Aaron,Hank Aaron
1,Tommie,Aaron,Tommie Aaron
2,Don,Aase,Don Aase
3,Andy,Abad,Andy Abad
............

Relational Databases:
Why relational database is useful?
- Straight-forward to extract aggregated data from complex files.
- Database scales well.
- It ensures all data is consistently formatted.

Database Schema
schemas=blueprints

Simple Queries
Select * from aadhaar_data;
Limit 20;
Select district subdistrict from aadhaar_data limit 20;

Write your own......

import pandas
import pandasql
def select_first_50(filename):
    # Read in our aadhaar_data csv to a pandas dataframe. Afterwards, we rename the columns
    # by replacing spaces with underscores and setting all characters to lowercase, so the
    # column names more closely resemble columns names one might find in a table.
    aadhaar_data = pandas.read_csv(filename)
    aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
    # Select out the first 50 values for "registrar" and "enrolment_agency"
    # in the aadhaar_data table using SQL syntax.
    #
    # Note that "enrolment_agency" is spelled with one l. Also, the order
    # of the select does matter. Make sure you select registrar then enrolment agency
    # in your query.
    q = """
    #YOUR QUERY HERE
   SELECT
   registrar, enrolment_agency
   FROM
   aadhaar_data
   LIMIT 50;
    """
    #Execute your SQL command against the pandas frame
    aadhaar_solution = pandasql.sqldf(q.lower(), locals())
    return aadhaar_solution

Complex Queries

Select * from aadhaar_data
where state='Gujarat';

Select district, sum(aadhaar_generated)
from aadhaar_data group by district;

Select district, subdistrict sum(aadhaar_generated)
from aadhaar_data where age>60 group by district, subdistrict;

Write your own....

import pandas
import pandasql
def aggregate_query(filename):
    # Read in our aadhaar_data csv to a pandas dataframe. Afterwards, we rename the columns
    # by replacing spaces with underscores and setting all characters to lowercase, so the
    # column names more closely resemble columns names one might find in a table.

    aadhaar_data = pandas.read_csv(filename)
    aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
    # Write a query that will select from the aadhaar_data table how many men and how
    # many women over the age of 50 have had aadhaar generated for them in each district
    #
    # Note that in this quiz, the SQL query keywords are case sensitive.
    # For example, if you want to do a sum make sure you type 'sum' rather than 'SUM'.
    #
    # The possible columns to select from aadhaar data are:
    #     1) registrar
    #     2) enrolment_agency
    #     3) state
    #     4) district
    #     5) sub_district
    #     6) pin_code
    #     7) gender
    #     8) age
    #     9) aadhaar_generated
    #     10) enrolment_rejected
    #     11) residents_providing_email,
    #     12) residents_providing_mobile_number
    #
    # You can download a copy of the aadhaar data that we are passing
    # into this exercise below:
    # https://www.dropbox.com/s/vn8t4uulbsfmalo/aadhaar_data.csv

    q = """# your code here
    select gender, district, sum(aadhaar_generated)
    from
    aadhaar_data
    where
    age > 50
    group by
    gender, district;
    """

    # Execute your SQL command against the pandas frame
    aadhaar_solution = pandasql.sqldf(q.lower(), locals())
    return aadhaar_solution

APIs
Files, Databases, APIs
Application Programming Interface

How to Access API Efficiently

API Exercise:

import json

import requests

def api_get_request(url):

# In this exercise, you want to call the last.fm API to get a list of the

# top artists in Spain.

# Once you've done this, return the name of the number 1 top artist in Spain.

url = 'http://ws.audioscrobbler.com/2.0/?method=geo.gettopartists&country=spain&api_key=0a5de916777ad83d6de29347308556a1&format=json'

data = requests.get(url).text

data = json.loads(data)

print data['topartists']['artist'][0]['name']

# return the top artist in Spain

Sanity Checking Data

Does data make sense?

Is there a problem?

Does the data look like I expect it to?

baseball=pandas.read_csv('../data/master.csv')

baseball.describe()

Count, Mean, Standard Deviation, Minimum, Maximum , 25%, 50%, 75%

Missing Values - why are they missing?
1. Occasionally system error prevent data from being recorded.
2. Some subset of subjects or event types are systematically missing certain data attributes, or missing entirely.

Occasional failure, nonresponse

Dealing with missing data
- Partial Deletion
(1) Listwise deletion
(2) Pairwise deletion
- Imputation
   why impute?
    - Not much data
    - removing data could affect representatives
    - many different imputation techniques
    - imputation is really hard to get right
- how to deal with
   (1) use mean value, good- doesn't change mean across sample; bad - lessens correlations between variables.
   (2) impute using linear regression,
         data we have ->make an equation to predict variable missing -> predict missing values
   Drawbacks: Over emphasize trends; Exact value suggest too much certainty

Imputaton Exercise:

from pandas import *
import numpy
def imputation(filename):
    # Pandas dataframes have a method called 'fillna(value)', such that you can
    # pass in a single value to replace any NAs in a dataframe or series. You
    # can call it like this:
    #     dataframe['column'] = dataframe['column'].fillna(value)
    #
    # Using the numpy.mean function, which calculates the mean of a numpy
    # array, impute any missing values in our Lahman baseball
    # data sets 'weight' column by setting them equal to the average weight.
    #
    # You can access the 'weight' colum in the baseball data frame by
    # calling baseball['weight']
    #baseball = pandas.read_csv(filename)

    #YOUR CODE GOES HERE
    baseball=pandas.read_csv('../data/Master.csv')
    baseball['weight'] = baseball['weight'].fillna(numpy.mean(baseball['weight']))
    print numpy.sum(baseball['weight']), numpy.mean(baseball['weight'])

    return baseball

Imputation:
- just the tip of the iceberg; more sophisticated methods exist;
- fill in mean and
- linear regression , both simple and relatively effective;
- both have negative side effects and can obscure amplify the trends in data;

Assignment #2
- Acquire weather data via weather underground API
- Get a sense of data using SQL queries
- Clean and process the data

BUS 597 Directed Study - Introduction to Analytics

Thursday, June 12, 2014

Lesson 2 Data Wrangling

1 comment:

Blog Archive