BUS 597 Directed Study - Introduction to Analytics: June 2014

Monday, June 23, 2014

Project 3 Analyzing NYC Subway and Weather Data

Exercise 1 - Exploratory Data Analysis

import numpy as np
import pandas
import matplotlib.pyplot as plt
def entries_histogram(turnstile_weather):
    '''
    Before we perform any analysis, it might be useful to take a
    look at the data we're hoping to analyze. More specifically, let's
    examine the hourly entries in our NYC subway data and determine what
    distribution the data follows. This data is stored in a dataframe
    called turnstile_weather under the ['ENTRIESn_hourly'] column.

    Let's plot two histograms on the same axes to show hourly
    entries when raining vs. when not raining. Here's an example on how
    to plot histograms with pandas and matplotlib:
    turnstile_weather['column_to_graph'].hist()

    Your histograph may look similar to bar graph in the instructor notes below.

    You can read a bit about using matplotlib and pandas to plot histograms here:
    http://pandas.pydata.org/pandas-docs/stable/visualization.html#histograms

    You can see the information contained within the turnstile weather data here:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
    '''

    plt.figure()
    turnstile_weather['...'] # your code here to plot a historgram for hourly entries when it is raining
    turnstile_weather['...'] # your code here to plot a historgram for hourly entries when it is not raining
    return plt

Exercise 2 - Welch's t-test?

Does entries data from previous exercise seem normally distributed? No

Can we run Welch's T Test on entries data? Yes.

Correct. Good job!

Exercise 3 - Mann Whitney U Test

import numpy as np
import scipy
import scipy.stats
import pandas
def mann_whitney_plus_means(turnstile_weather):
    '''
    This function will consume the turnstile_weather dataframe containing
    our final turnstile weather data.

    You will want to take the means and run the Mann Whitney U test on the
    ENTRIESn_hourly column in the turnstile_weather dataframe.

    This function should return:
        1) the mean of entries with rain
        2) the mean of entries without rain
        3) the Mann-Whitney U statistic and p-value comparing number of entries
           with rain and the number of entries without rain.

    You should feel free to use scipy's Mann-Whitney implementation, and
    also might find it useful to use numpy's mean function.

    Here are some documentations:
    http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html

    You can look at the final turnstile weather data at the link below:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
    '''


    return with_rain_mean, without_rain_mean, U, p

Exercise 4 Rainy Day Ridership vs. Non Rainy Ridership

Is the distribution of the number of entries statistically different between rainy and non rainy days? Yes.

Described your results and methods. Rainy days have more ridership.

Correct. Good job!

Exercise 5 Linear Regression

import numpy as np
import pandas
def normalize_features(array):
   """
   Normalize the features in our data set.
   """
   array_normalized = (array-array.mean())/array.std()
   mu = array.mean()
   sigma = array.std()
   return array_normalized, mu, sigma
def compute_cost(features, values, theta):
    """
    Compute the cost function given a set of features / values, and the values for our thetas.

    This should be the same code as the compute_cost function in the lesson #3 exercises. But
    feel free to implement your own.
    """

    # your code here
    return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.

    This is the same gradient descent code as in the lesson #3 exercises. But feel free
    to implement your own.
    """
    m = len(values)
    cost_history = []
    for i in range(num_iterations):
        # your code here
    return theta, pandas.Series(cost_history)
def predictions(dataframe):
    '''
    The NYC turnstile data is stored in a pandas dataframe called weather_turnstile.
    Using the information stored in the dataframe, lets predict the ridership of
    the NYC subway using linear regression with gradient descent.

    You can look at information contained in the turnstile weather dataframe
    at the link below:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv

    Your prediction should have a R^2 value of .40 or better.

    Note: due to the memory and CPU limitation of our amazon EC2 instance, we will
    give you a random subet (~15%) of the data contained in turnstile_data_master_with_weather.csv

    If you receive a "server has encountered an error" message, that means you are hitting
    the 30 second limit that's placed on running your program. Try using a smaller number
    for num_iterations if that's the case.

    Or if you are using your own algorithm/modesl, see if you can optimize your code so it
    runs faster.
    '''
    dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
    features = dataframe[['rain', 'precipi', 'Hour', 'meantempi']].join(dummy_units)
    values = dataframe[['ENTRIESn_hourly']]
    m = len(values)
    features, mu, sigma = normalize_features(features)
    features['ones'] = np.ones(m)
    features_array = np.array(features)
    values_array = np.array(values).flatten()
    #Set values for alpha, number of iterations.
    alpha = 0.1 # please feel free to play with this value
    num_iterations = 75 # please feel free to play with this value
    #Initialize theta, perform gradient descent
    theta_gradient_descent = np.zeros(len(features.columns))
    theta_gradient_descent, cost_history = gradient_descent(features_array, values_array, theta_gradient_descent,
                                                            alpha, num_iterations)
    predictions = np.dot(features_array, theta_gradient_descent)
    return predictions

Exercise 6 Plotting Residuals

import numpy as np
import scipy
import matplotlib.pyplot as plt
def plot_residuals(turnstile_weather, predictions):
    '''
    Using the same methods that we used to plot a histogram of entries
    per hour for our data, why don't you make a histogram of the residuals
    (that is, the difference between the original hourly entry data and the predicted values).
    Based on this residual histogram, do you have any insight into how our model
    performed? Reading a bit on this webpage might be useful:
    http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm
    '''

    plt.figure()
    (turnstile_weather['''choose the right column here'''] - predictions).hist()
    return plt

Exercise 7 Compute R Squared

import numpy as np
import scipy
import matplotlib.pyplot as plt
import sys
def compute_r_squared(data, predictions):
    '''
    In exercise 5, we calculated the R^2 value for you. But why don't you try and
    and calculate the R^2 value yourself.

    Given a list of original data points, and also a list of predicted data points,
    write a function that will compute and return the coefficient of determination (R^2)
    for this data. numpy.mean() and numpy.sum() might both be useful here, but
    not necessary.
    Documentation about numpy.mean() and numpy.sum() below:
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
    '''

    # your code here

    return r_squared

Exercise 8 Non Gradient Decent Linear Regression

import numpy as np
import pandas
import scipy
import statsmodels
"""
In this optional exercise, you should complete the function called
predictions(turnstile_weather). This function takes in our pandas
turnstile weather dataframe, and returns a set of predicted ridership values,
based on the other information in the dataframe.
You should attempt to implement another type of linear regression,
that you may have read about, such as ordinary least squares regression:
http://en.wikipedia.org/wiki/Ordinary_least_squares
This is your playground. Go wild!
How does your choice of linear regression compare to linear regression
with gradient descent?
You can look at the information contained in the turnstile_weather dataframe below:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
Note: due to the memory and CPU limitation of our amazon EC2 instance, we will
give you a random subset (~15%) of the data contained in turnstile_data_master_with_weather.csv
If you receive a "server has encountered an error" message, that means you are hitting
the 30 second limit that's placed on running your program. See if you can optimize your code so it
runs faster.
"""
def predictions(weather_turnstile):
    #
    # Your implementation goes here. Feel free to write additional
    # helper functions
    #
    return prediction

Friday, June 20, 2014

Lesson 3 Data Analysis

Statistical Rigor
    Significant tests
    - using our data, can we disprove an assumption with a pre-defined level of confidence?

Why is statistics useful?
- They provide a formalized framework for comparing and evaluating data.
- They enable us to evaluate whether perceiving effects in our dataset reflect differences across the whole population.

Statistical Significant Tests
- many tests might make assumptions about data's distribution.
- very common distribution - normal distribution (aka Gaussian distribution, bell curve)

t-test
accept or reject a null hypothesis.
Null hypothesis: a statement we are trying to disprove by running our test.
- two samples can from the same population
- a sample is drawn from a probability distribution
specified in terms of a test statistic

Test Statistic: one number that helps accept or reject the null hypothesis
t test - t

A few different versions depending on assumptions
- equal sample size?
- same variance?

Two sample test

calculate t, calculate nu, calculate p;
p value: probability of obtaining a test statistic at least as extreme as ours if null hypothesis was true.
set p critical, if p<p critical, reject null hypothesis
else cannot reject null hypothesis.

Exercise: Calculate t and Nu

Welch's t-test in Python
Is there a simple way to do this in Python?

import scipy.stats
scipy.stats ttest_ind(list_1, list_2, equal_var_false)

scipy.stats ttest_ind assumes a two-side test. How could we use the output to instead perform a one-side test?
great, p/2 < p ritical, t>0
less, p/2 >p critical, t<0

Exercise Welch's t-test

import numpy
import scipy.stats
import pandas
def compare_averages(filename):
    """
    Performs a t-test on two sets of baseball data (left-handed and right-handed hitters).
    You will be given a csv file that has three columns. A player's
    name, handedness (L for lefthanded or R for righthanded) and their
    career batting average (called 'avg'). You can look at the csv
    file via the following link:
    https://www.dropbox.com/s/xcn0u2uxm8c4n6l/baseball_data.csv

    Write a function that will read that the csv file into a pandas data frame,
    and run Welch's t-test on the two cohorts defined by handedness.

    One cohort should be a data frame of right-handed batters. And the other
    cohort should be a data frame of left-handed batters.

    We have included the scipy.stats library to help you write
    or implement Welch's t-test:
    http://docs.scipy.org/doc/scipy/reference/stats.html

    With a significance level of 95%, if there is no difference
    between the two cohorts, return a tuple consisting of
    True, and then the tuple returned by scipy.stats.ttest.

    If there is a difference, return a tuple consisting of
    False, and then the tuple returned by scipy.stats.ttest.

    For example, the tuple that you return may look like:
    (True, (9.93570222, 0.000023))
    """

Non Parametric Tests
A statistical test that does not assume our data is drawn from any particular underlying probability distribution.

Mann-Whitney U test: tests null hypothesis that the two populations are the same.
u,p=scipy.stats.Mannwhitneyu(x,y)

Non-normal Data
Shapro-Wilk test
w,p=scipy.stats.Shapiro(data)

Just a tip of iceberg
Many different statistical methods!
Data scientists have large toolkits.

Predicting Future Data
How might we use the data we've collected to make predictions the data we don't have?

Machine Learning: a branch of artificial intelligence focuses on constructing systems that learn from large amounts of data to make predictions.

Why machine learning is useful?

Statistics vs. Machine Learning
What is the difference between statistics and machine learning?
- Statistics is focused on analyzing existing data, and drawing valid conclusions.
- Machine learning is focused on making predictions.

Types of Machine Learning
Data- Model- Predictions
Supervising Learning: such detect spam emails
- Have examples with input and output;
- Predict output for future, input-only data;
- Classification;
- Regression.

Unsupervised Learning:
- Trying to understand structure of data
- Clustering

Kurt's favorite machine learning algorithm: clustering and CPA.

Predicting HR: example of supervising learning
- can we write an equation that takes a bunch of info (e.g., height, weight, birth year, position, for baseball players) and predict HR (homerun)? Using regression.

Linear Regression with Gradient Decent
If we have age, height, weight, and batting average, and are trying to predict batting average, what are input variables? Age, Height, and Weight.

Gradient Decent
Cost function: J(0), we want to minimize this!

how: apha, learning rate (it's like a decreasing slope rate)

Gradient Decent in Python

import numpy
import pandas
def compute_cost(features, values, theta):
    """
    Compute the cost function given a set of features / values, and values for our thetas.
    """
    m = len(values)
    sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
    cost = sum_of_square_errors / (2*m)
    return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.
    """
    # Write some code here that updates the values of theta a number of times equal to
    # num_iterations. Everytime you have computed the cost for a given set of thetas,
    # you should append it to cost_history. The function should return both the final
    # values of theta and the cost history.
    # YOUR CODE GOES HERE
    m=len(values)
    cost_history=[]

    for I in range (num_iterations):
         predicted_values=numpy.dot(features, theta)
         theta=theta - alpha/m*numpy.dot((predicted_values - values), features)

         cost=compute_cost(features, values, theta)
         cost_history.append(cost)

    return theta, pandas.Series(cost_history)

Good job! Your program and algorithm worked

Theta =
[ 45.35759233 -9.02442042 13.69229668]

Cost History =
0     3748.133469
1     3727.492258
2     3707.261946
3     3687.434249
4     3668.001052
5     3648.954405
6     3630.286519
7     3611.989767
8     3594.056675
9     3576.479921
10    3559.252334
11    3542.366888
12    3525.816700
13    3509.595027
14    3493.695263
...
985    2686.730820
986    2686.730290
987    2686.729764
988    2686.729240
989    2686.728720
990    2686.728203
991    2686.727690
992    2686.727179
993    2686.726672
994    2686.726168
995    2686.725668
996    2686.725170
997    2686.724676
998    2686.724185
999    2686.723697
Length: 1000, dtype: float64

Coefficient of Determination: R squared
data: yi,.......,yn
preductions: fi,......,fn
average of data= y-

R2 (R squared)

Calculating R Squared

import numpy as np
def compute_r_squared(data, predictions):
    # Write a function that, given two input numpy arrays, 'data', and 'predictions,'
    # returns the coefficient of determination, R^2, for the model that produced
    # predictions.
    #
    # Numpy has a couple of functions -- np.mean() and np.sum() --
    # that you might find useful, but you don't have to use them.
    # YOUR CODE GOES HERE

    SST=((data-np.mean(data)**2).sum()
    SSReg=((predictions-data)**2).sum()
    r_squared=1-SSReg/SST

    return r_squared

Additional Considerations
- other types of linear regression
   - ordinary least squares regression
- parameter estimation: what are the confidence intervals of our parameters? What is the likelihood we would calculate this parameter value if the parameter had no effect?
- under/ over fitting
- multiple local minima: use different random initial thetas; seed random values for repeatability.

Assignment #3
t-test: does subway ridership more in rainy or weekend?
linear regression with gradient decent: how many subway riders n a particular day and time? time of day, weekday of week, rainy or not rainy day,...

Project 2 - Wrangle NYC Subway and Weather Data

Exercise 1 Number of Rainy Days

import pandas
import pandasql

def num_rainy_days(filename):
    '''
    This function should run a SQL query on a dataframe of
    weather data. The SQL query should return one column and
    one row - a count of the number of days in the dataframe where
    the rain column is equal to 1 (i.e., the number of days it
    rained). The dataframe will be titled 'weather_data'. You'll
    need to provide the SQL query. You might find SQL's count function
    useful for this exercise. You can read more about it here:

    https://dev.mysql.com/doc/refman/5.1/en/counting-rows.html

    You might also find that interpreting numbers as integers or floats may not
    work initially. In order to get around this issue, it may be equal to cast
    these numbers as integers. This can be done by writing cast(column as integer).
    So for example, if we wanted to cast the maxtempi column as an integer, we would actually
    write something like where cast(maxtempi as integer) = 76, as opposed to simply
    where maxtempi = 76.

    You can see the weather data that we are passing in below:
    https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
    '''
    weather_data = pandas.read_csv(filename)
    q = """
    your query here
    """

    #Execute your SQL command against the pandas frame
    rainy_days = pandasql.sqldf(q.lower(), locals())
    return rainy_days

Exercise 2 Temp on Foggy and Non Foggy Days

import pandas
import pandasql

def max_temp_aggregate_by_fog(filename):
    '''
    This function should run a SQL query on a dataframe of
    weather data. The SQL query should return two columns and
    two rows - whether it was foggy or not (0 or 1) and the max
    maxtempi for that fog value (i.e., the maximum max temperature
    for both foggy and non-foggy days). The dataframe will be
    titled 'weather_data'. You'll need to provide the SQL query.

    You might also find that interpreting numbers as integers or floats may not
    work initially. In order to get around this issue, it may be useful to cast
    these numbers as integers. This can be done by writing cast(column as integer).
    So for example, if we wanted to cast the maxtempi column as an integer, we would actually
    write something like where cast(maxtempi as integer) = 76, as opposed to simply
    where maxtempi = 76.

    You can see the weather data that we are passing in below:
    https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
    '''
    weather_data = pandas.read_csv(filename)
    q = """
    your query here
    """

    #Execute your SQL command against the pandas frame
    rainy_days = pandasql.sqldf(q.lower(), locals())
    return rainy_days
Exercise 3 Mean Temp on Weekends

import pandas
import pandasql
def avg_min_temperature(filename):
    '''
    This function should run a SQL query on a dataframe of
    weather data. The SQL query should return one column and
    one row - the average meantempi on days that are a Saturday
    or Sunday (i.e., the the average mean temperature on weekends).
    The dataframe will be titled 'weather_data' and you can access
    the date in the dataframe via the 'date' column.

    You'll need to provide the SQL query.

    You might also find that interpreting numbers as integers or floats may not
    work initially. In order to get around this issue, it may be equal to cast
    these numbers as integers. This can be done by writing cast(column as integer).
    So for example, if we wanted to cast the maxtempi column as an integer, we would actually
    write something like where cast(maxtempi as integer) = 76, as opposed to simply
    where maxtempi = 76.

    Also, you can convert dates to days of the week via the 'strftime' keyword in SQL.
    For example, cast (strftime('%w', date) as integer) will return 0 if the date
    is a Sunday or 6 if the date is a Saturday.

    You can see the weather data that we are passing in below:
    https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
    '''
    weather_data = pandas.read_csv(filename)
    q = """
    your query here
    """

    #Execute your SQL command against the pandas frame
    mean_temp_weekends = pandasql.sqldf(q.lower(), locals())
    return mean_temp_weekends

Exercise 4 Mean Temp on Rainy Days

import pandas
import pandasql
def avg_min_temperature(filename):
    '''
    This function should run a SQL query on a dataframe of
    weather data. More specifically you want to find the average
    minimum temperature on rainy days where the minimum temperature
    is greater than 55 degrees.

    You might also find that interpreting numbers as integers or floats may not
    work initially. In order to get around this issue, it may be equal to cast
    these numbers as integers. This can be done by writing cast(column as integer).
    So for example, if we wanted to cast the maxtempi column as an integer, we would actually
    write something like where cast(maxtempi as integer) = 76, as opposed to simply
    where maxtempi = 76.

    You can see the weather data that we are passing in below:
    https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
    '''
    weather_data = pandas.read_csv(filename)
    q = """
    your query here
    """

    #Execute your SQL command against the pandas frame
    mean_temp_weekends = pandasql.sqldf(q.lower(), locals())
    return mean_temp_weekends

Exercise 5 Fixing NYC Subway Turnstile Data

import csv
def fix_turnstile_data(filenames):
    '''
    Filenames is a list of MTA Subway turnstile text files. A link to an example
    MTA Subway turnstile text file can be seen at the URL below:
    http://web.mta.info/developers/data/nyct/turnstile/turnstile_110507.txt

    As you can see, there are numerous data points included in each row of the
    a MTA Subway turnstile text file.
    You want to write a function that will update each row in the text
    file so there is only one entry per row. A few examples below:
    A002,R051,02-00-00,05-28-11,00:00:00,REGULAR,003178521,001100739
    A002,R051,02-00-00,05-28-11,04:00:00,REGULAR,003178541,001100746
    A002,R051,02-00-00,05-28-11,08:00:00,REGULAR,003178559,001100775

    Write the updates to a different text file in the format of "updated_" + filename.
    For example:
        1) if you read in a text file called "turnstile_110521.txt"
        2) you should write the updated data to "updated_turnstile_110521.txt"
    The order of the fields should be preserved.

    You can see a sample of the turnstile text file that's passed into this function
    and the the corresponding updated file in the links below:

    Sample input file:
    https://www.dropbox.com/s/mpin5zv4hgrx244/turnstile_110528.txt
    Sample updated file:
    https://www.dropbox.com/s/074xbgio4c39b7h/solution_turnstile_110528.txt
    '''
    for name in filenames:
        # your code here
Exercise 6 Combining NYC Subway Turnstile Data

def create_master_turnstile_file(filenames, output_file):
    '''
    Write a function that takes the files in the list filenames, which all have the
    columns 'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn', and consolidates
    them into one file located at output_file. There should be ONE row with the column
    headers, located at the top of the file.

    For example, if file_1 has:
    'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
    line 1 ...
    line 2 ...

    and another file, file_2 has:
    'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
    line 3 ...
    line 4 ...
    line 5 ...

    We need to combine file_1 and file_2 into a master_file like below:
     'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
    line 1 ...
    line 2 ...
    line 3 ...
    line 4 ...
    line 5 ...
    '''
    with open(output_file, 'w') as master_file:
       master_file.write('C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn\n')
       for filename in filenames:
                # your code here
Exercise 7 Filtering "Non Regular" Data

import pandas
def filter_by_regular(filename):
    '''
    This function should read the csv file located at filename into a pandas dataframe,
    and filter the dataframe to only rows where the 'DESCn' column has the value 'REGULAR'.

    For example, if the pandas dataframe is as follows:
    ,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn
    0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
    1,A002,R051,02-00-00,05-01-11,04:00:00,DOOR,3144335,1088159
    2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
    3,A002,R051,02-00-00,05-01-11,12:00:00,DOOR,3144424,1088231

    The dataframe will look like below after filtering to only rows where DESCn column
    has the value 'REGULAR':
    0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
    2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
    '''

    turnstile_data = # your code here
    # more of your code here
    return turnstile_data

Exercise 8 Get Hourly Entries

import pandas
def get_hourly_entries(df):
    '''
    The data in the MTA Subway Turnstile data reports on the cumulative
    number of entries and exits per row. Assume that you have a dataframe
    called df that contains only the rows for a particular turnstile machine
    (i.e., unique SCP, C/A, and UNIT). This function should change
    these cumulative entry numbers to a count of entries since the last reading
    (i.e., entries since the last row in the dataframe).

    More specifically, you want to do two things:
       1) Create a new column called ENTRIESn_hourly
       2) Assign to the column the difference between ENTRIESn of the current row
          and the previous row. If there is any NaN, fill/replace it with 1.

    You may find the pandas functions shift() and fillna() to be helpful in this exercise.

    Examples of what your dataframe should look like at the end of this exercise:

           C/A UNIT       SCP     DATEn     TIMEn    DESCn ENTRIESn    EXITSn ENTRIESn_hourly
    0     A002 R051 02-00-00 05-01-11 00:00:00 REGULAR   3144312   1088151                1
    1     A002 R051 02-00-00 05-01-11 04:00:00 REGULAR   3144335   1088159               23
    2     A002 R051 02-00-00 05-01-11 08:00:00 REGULAR   3144353   1088177               18
    3     A002 R051 02-00-00 05-01-11 12:00:00 REGULAR   3144424   1088231               71
    4     A002 R051 02-00-00 05-01-11 16:00:00 REGULAR   3144594   1088275              170
    5     A002 R051 02-00-00 05-01-11 20:00:00 REGULAR   3144808   1088317              214
    6     A002 R051 02-00-00 05-02-11 00:00:00 REGULAR   3144895   1088328               87
    7     A002 R051 02-00-00 05-02-11 04:00:00 REGULAR   3144905   1088331               10
    8     A002 R051 02-00-00 05-02-11 08:00:00 REGULAR   3144941   1088420               36
    9     A002 R051 02-00-00 05-02-11 12:00:00 REGULAR   3145094   1088753              153
    10    A002 R051 02-00-00 05-02-11 16:00:00 REGULAR   3145337   1088823              243
    ...
    ...
    '''
    #your code here
    return df

Exercise 9 Get Hourly Exists

import pandas
def get_hourly_exits(df):
    '''
    The data in the MTA Subway Turnstile data reports on the cumulative
    number of entries and exits per row. Assume that you have a dataframe
    called df that contains only the rows for a particular turnstile machine
    (i.e., unique SCP, C/A, and UNIT). This function should change
    these cumulative exit numbers to a count of exits since the last reading
    (i.e., exits since the last row in the dataframe).

    More specifically, you want to do two things:
       1) Create a new column called EXITSn_hourly
       2) Assign to the column the difference between EXITSn of the current row
          and the previous row. If there is any NaN, fill/replace it with 0.

    You may find the pandas functions shift() and fillna() to be helpful in this exercise.

    Example dataframe below:
          Unnamed: 0   C/A UNIT       SCP     DATEn     TIMEn    DESCn ENTRIESn    EXITSn ENTRIESn_hourly EXITSn_hourly
    0              0 A002 R051 02-00-00 05-01-11 00:00:00 REGULAR   3144312   1088151                0              0
    1              1 A002 R051 02-00-00 05-01-11 04:00:00 REGULAR   3144335   1088159               23              8
    2              2 A002 R051 02-00-00 05-01-11 08:00:00 REGULAR   3144353   1088177               18             18
    3              3 A002 R051 02-00-00 05-01-11 12:00:00 REGULAR   3144424   1088231               71             54
    4              4 A002 R051 02-00-00 05-01-11 16:00:00 REGULAR   3144594   1088275              170             44
    5              5 A002 R051 02-00-00 05-01-11 20:00:00 REGULAR   3144808   1088317              214             42
    6              6 A002 R051 02-00-00 05-02-11 00:00:00 REGULAR   3144895   1088328               87             11
    7              7 A002 R051 02-00-00 05-02-11 04:00:00 REGULAR   3144905   1088331               10              3
    8              8 A002 R051 02-00-00 05-02-11 08:00:00 REGULAR   3144941   1088420               36             89
    9              9 A002 R051 02-00-00 05-02-11 12:00:00 REGULAR   3145094   1088753              153            333
    '''

    #your code here
    return df

Exercise 10 Time to Hour

import pandas
def time_to_hour(time):
    '''
    Given an input variable time that represents time in the format of:
    00:00:00 (hour:minutes:seconds)

    Write a function to extract the hour part from the input variable time
    and return it as an integer. For example:
        1) if hour is 00, your code should return 0
        2) if hour is 01, your code should return 1
        3) if hour is 21, your code should return 21

    Please return hour as an integer.
    '''

    hour = # your code here
    return hour

Exercise 11 Reformat Subway Dates

import datetime
def reformat_subway_dates(date):
    '''
    The dates in our subway data are formatted in the format month-day-year.
    The dates in our weather underground data are formatted year-month-day.

    In order to join these two data sets together, we'll want the dates formatted
    the same way. Write a function that takes as its input a date in the MTA Subway
    data format, and returns a date in the weather underground format.

    Hint:
    There is a useful function in the datetime library called strptime.
    More info can be seen here:
    http://docs.python.org/2/library/datetime.html#datetime.datetime.strptime
    '''
    date_formatted = # your code here
    return date_formatted

Thursday, June 12, 2014

Lesson 2 Data Wrangling

Data Manipulation
- from files, databases, and Web APIs.

Dealing with messy data. What tools do you use?

Acquiring data
- Acquiring data is often not funcy.
- Find data on the internet.
- A lot of data stored in text files and on government websites.

Common data formats:
- CSV
- XML
- JSON(nested structure)

import pandas
def add_full_name(path_to_csv, path_to_new_csv):
    #Assume you will be reading in a csv file with the same columns that the
    #Lahman baseball data set has -- most importantly, there are columns
    #called 'nameFirst' and 'nameLast'.
    #1) Write a function that reads a csv
    #located at "path_to_csv" into a pandas dataframe and adds a new column
    #called 'nameFull' with a player's full name.
    #
    #For example:
    #   for Hank Aaron, nameFull would be 'Hank Aaron',
#
#2) Write the data in the pandas dataFrame to a new csv file located at
#path_to_new_csv
    #WRITE YOUR CODE HERE
    baseballData = pandas.read_csv(path_to_csv)
    baseballData['nameFull'] = baseballData['nameFirst'] + ' ' + baseballData['nameLast']
    baseballData.to_csv(path_to_new_csv)

Good job! You updated the CSV file correctly.
The following is the output by your program (showing only the name related columns):

0,Hank,Aaron,Hank Aaron
1,Tommie,Aaron,Tommie Aaron
2,Don,Aase,Don Aase
3,Andy,Abad,Andy Abad
............

Relational Databases:
Why relational database is useful?
- Straight-forward to extract aggregated data from complex files.
- Database scales well.
- It ensures all data is consistently formatted.

Database Schema
schemas=blueprints

Simple Queries
Select * from aadhaar_data;
Limit 20;
Select district subdistrict from aadhaar_data limit 20;

Write your own......

import pandas
import pandasql
def select_first_50(filename):
    # Read in our aadhaar_data csv to a pandas dataframe. Afterwards, we rename the columns
    # by replacing spaces with underscores and setting all characters to lowercase, so the
    # column names more closely resemble columns names one might find in a table.
    aadhaar_data = pandas.read_csv(filename)
    aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
    # Select out the first 50 values for "registrar" and "enrolment_agency"
    # in the aadhaar_data table using SQL syntax.
    #
    # Note that "enrolment_agency" is spelled with one l. Also, the order
    # of the select does matter. Make sure you select registrar then enrolment agency
    # in your query.
    q = """
    #YOUR QUERY HERE
   SELECT
   registrar, enrolment_agency
   FROM
   aadhaar_data
   LIMIT 50;
    """
    #Execute your SQL command against the pandas frame
    aadhaar_solution = pandasql.sqldf(q.lower(), locals())
    return aadhaar_solution

Complex Queries

Select * from aadhaar_data
where state='Gujarat';

Select district, sum(aadhaar_generated)
from aadhaar_data group by district;

Select district, subdistrict sum(aadhaar_generated)
from aadhaar_data where age>60 group by district, subdistrict;

Write your own....

import pandas
import pandasql
def aggregate_query(filename):
    # Read in our aadhaar_data csv to a pandas dataframe. Afterwards, we rename the columns
    # by replacing spaces with underscores and setting all characters to lowercase, so the
    # column names more closely resemble columns names one might find in a table.

    aadhaar_data = pandas.read_csv(filename)
    aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
    # Write a query that will select from the aadhaar_data table how many men and how
    # many women over the age of 50 have had aadhaar generated for them in each district
    #
    # Note that in this quiz, the SQL query keywords are case sensitive.
    # For example, if you want to do a sum make sure you type 'sum' rather than 'SUM'.
    #
    # The possible columns to select from aadhaar data are:
    #     1) registrar
    #     2) enrolment_agency
    #     3) state
    #     4) district
    #     5) sub_district
    #     6) pin_code
    #     7) gender
    #     8) age
    #     9) aadhaar_generated
    #     10) enrolment_rejected
    #     11) residents_providing_email,
    #     12) residents_providing_mobile_number
    #
    # You can download a copy of the aadhaar data that we are passing
    # into this exercise below:
    # https://www.dropbox.com/s/vn8t4uulbsfmalo/aadhaar_data.csv

    q = """# your code here
    select gender, district, sum(aadhaar_generated)
    from
    aadhaar_data
    where
    age > 50
    group by
    gender, district;
    """

    # Execute your SQL command against the pandas frame
    aadhaar_solution = pandasql.sqldf(q.lower(), locals())
    return aadhaar_solution

APIs
Files, Databases, APIs
Application Programming Interface

How to Access API Efficiently

API Exercise:

import json

import requests

def api_get_request(url):

# In this exercise, you want to call the last.fm API to get a list of the

# top artists in Spain.

# Once you've done this, return the name of the number 1 top artist in Spain.

url = 'http://ws.audioscrobbler.com/2.0/?method=geo.gettopartists&country=spain&api_key=0a5de916777ad83d6de29347308556a1&format=json'

data = requests.get(url).text

data = json.loads(data)

print data['topartists']['artist'][0]['name']

# return the top artist in Spain

Sanity Checking Data

Does data make sense?

Is there a problem?

Does the data look like I expect it to?

baseball=pandas.read_csv('../data/master.csv')

baseball.describe()

Count, Mean, Standard Deviation, Minimum, Maximum , 25%, 50%, 75%

Missing Values - why are they missing?
1. Occasionally system error prevent data from being recorded.
2. Some subset of subjects or event types are systematically missing certain data attributes, or missing entirely.

Occasional failure, nonresponse

Dealing with missing data
- Partial Deletion
(1) Listwise deletion
(2) Pairwise deletion
- Imputation
   why impute?
    - Not much data
    - removing data could affect representatives
    - many different imputation techniques
    - imputation is really hard to get right
- how to deal with
   (1) use mean value, good- doesn't change mean across sample; bad - lessens correlations between variables.
   (2) impute using linear regression,
         data we have ->make an equation to predict variable missing -> predict missing values
   Drawbacks: Over emphasize trends; Exact value suggest too much certainty

Imputaton Exercise:

from pandas import *
import numpy
def imputation(filename):
    # Pandas dataframes have a method called 'fillna(value)', such that you can
    # pass in a single value to replace any NAs in a dataframe or series. You
    # can call it like this:
    #     dataframe['column'] = dataframe['column'].fillna(value)
    #
    # Using the numpy.mean function, which calculates the mean of a numpy
    # array, impute any missing values in our Lahman baseball
    # data sets 'weight' column by setting them equal to the average weight.
    #
    # You can access the 'weight' colum in the baseball data frame by
    # calling baseball['weight']
    #baseball = pandas.read_csv(filename)

    #YOUR CODE GOES HERE
    baseball=pandas.read_csv('../data/Master.csv')
    baseball['weight'] = baseball['weight'].fillna(numpy.mean(baseball['weight']))
    print numpy.sum(baseball['weight']), numpy.mean(baseball['weight'])

    return baseball

Imputation:
- just the tip of the iceberg; more sophisticated methods exist;
- fill in mean and
- linear regression , both simple and relatively effective;
- both have negative side effects and can obscure amplify the trends in data;

Assignment #2
- Acquire weather data via weather underground API
- Get a sense of data using SQL queries
- Clean and process the data

Project 1 Titanic Data

Simple Heuristic

import numpy
import pandas
import statsmodels.api as sm
def simple_heuristic(file_path):
    '''
    In this exercise, we will perform some rudimentary practices similar to those of
    an actual data scientist.

    Part of a data scientist's job is to use her or his intuition and insight to
    write algorithms and heuristics. A data scientist also creates mathematical models
    to make predictions based on some attributes from the data that they are examining.
    We would like for you to take your knowledge and intuition about the Titanic
    and its passengers' attributes to predict whether or not the passengers survived
    or perished. You can read more about the Titanic and specifics about this dataset at:
    http://en.wikipedia.org/wiki/RMS_Titanic
    http://www.kaggle.com/c/titanic-gettingStarted

    In this exercise and the following ones, you are given a list of Titantic passengers
    and their associated information. More information about the data can be seen at the
    link below:
    http://www.kaggle.com/c/titanic-gettingStarted/data.
    For this exercise, you need to write a simple heuristic that will use
    the passengers' gender to predict if that person survived the Titanic diaster.

    You prediction should be 78% accurate or higher.

    Here's a simple heuristic to start off:
       1) If the passenger is female, your heuristic should assume that the
       passenger survived.
       2) If the passenger is male, you heuristic should
       assume that the passenger did not survive.

    You can access the gender of a passenger via passenger['Sex'].
    If the passenger is male, passenger['Sex'] will return a string "male".
    If the passenger is female, passenger['Sex'] will return a string "female".
    Write your prediction back into the "predictions" dictionary. The
    key of the dictionary should be the passenger's id (which can be accessed
    via passenger["PassengerId"]) and the associated value should be 1 if the
    passenger survied or 0 otherwise.
    For example, if a passenger is predicted to have survived:
    passenger_id = passenger['PassengerId']
    predictions[passenger_id] = 1
    And if a passenger is predicted to have perished in the disaster:
    passenger_id = passenger['PassengerId']
    predictions[passenger_id] = 0

    You can also look at the Titantic data that you will be working with
    at the link below:
    https://www.dropbox.com/s/r5f9aos8p9ri9sa/titanic_data.csv
    '''
    predictions = {}
    df = pandas.read_csv(file_path)
    for passenger_index, passenger in df.iterrows():
        #
        # Your code here
        #
    return predictions

Tuesday, June 10, 2014

Lesson 1 Introduction

What is data scientist? What do you think data scientist do in their day-to-day work?
A: A data scientists day to day job may vary from company to company and from industry to industry. There is really no right or wrong answer.

Hacking skills, Math & Statistics Knowledge, Substantive Expertise. Data Science is the center of three fields.

Basic Data Scientist skills: What does it mean a data scientist to have "substantive expertise" and why is it important?
A: It is important for data scientists to have substantive expertise so they know what questions to ask and investigate, as well as how to interpret the data under the right context.

How can we solve real world problems with data science?
A: Data science can solve problems you'll expect: Netflix, Social Media, Web Apps (OkCupid, Uber, etc.) And a ton more you might not expect: Bioinformatics, Urban Planning, Astrophysics, Public Health.

Pandas: R + Python = Pandas, 1. useful for manipulating data, 2. R-lite elements in Python.

Dataframes: name, age, fare, survived.
Create a new dataframe

Assignment: Titanic, predict who will survive?
Class Project: Analyzing subway ridership, weather underground data