Monday, June 23, 2014

Project 3 Analyzing NYC Subway and Weather Data

Exercise 1 - Exploratory Data Analysis

import numpy as np
import pandas
import matplotlib.pyplot as plt
def entries_histogram(turnstile_weather):
    '''
    Before we perform any analysis, it might be useful to take a
    look at the data we're hoping to analyze. More specifically, let's
    examine the hourly entries in our NYC subway data and determine what
    distribution the data follows. This data is stored in a dataframe
    called turnstile_weather under the ['ENTRIESn_hourly'] column.
   
    Let's plot two histograms on the same axes to show hourly
    entries when raining vs. when not raining. Here's an example on how
    to plot histograms with pandas and matplotlib:
    turnstile_weather['column_to_graph'].hist()
   
    Your histograph may look similar to bar graph in the instructor notes below.
   
    You can read a bit about using matplotlib and pandas to plot histograms here:
    http://pandas.pydata.org/pandas-docs/stable/visualization.html#histograms
   
    You can see the information contained within the turnstile weather data here:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
    '''
   
    plt.figure()
    turnstile_weather['...'] # your code here to plot a historgram for hourly entries when it is raining
    turnstile_weather['...'] # your code here to plot a historgram for hourly entries when it is not raining
    return plt


Exercise 2 - Welch's t-test?

Does entries data from previous exercise seem normally distributed? No

Can we run Welch's T Test on entries data? Yes.

Correct. Good job!


Exercise 3 - Mann Whitney U Test

import numpy as np
import scipy
import scipy.stats
import pandas
def mann_whitney_plus_means(turnstile_weather):
    '''
    This function will consume the turnstile_weather dataframe containing
    our final turnstile weather data.
   
    You will want to take the means and run the Mann Whitney U test on the
    ENTRIESn_hourly column in the turnstile_weather dataframe.
   
    This function should return:
        1) the mean of entries with rain
        2) the mean of entries without rain
        3) the Mann-Whitney U statistic and p-value comparing number of entries
           with rain and the number of entries without rain. 
   
    You should feel free to use scipy's Mann-Whitney implementation, and
    also might find it useful to use numpy's mean function. 
   
    Here are some documentations:
    http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
   
    You can look at the final turnstile weather data at the link below:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
    '''
   
   
    return with_rain_mean, without_rain_mean, U, p


Exercise 4 Rainy Day Ridership vs. Non Rainy Ridership

Is the distribution of the number of entries statistically different between rainy and non rainy days? Yes.

Described your results and methods. Rainy days have more ridership.

Correct. Good job!


Exercise 5 Linear Regression

import numpy as np
import pandas
def normalize_features(array):
   """
   Normalize the features in our data set.
   """
   array_normalized = (array-array.mean())/array.std()
   mu = array.mean()
   sigma = array.std()
   return array_normalized, mu, sigma
def compute_cost(features, values, theta):
    """
    Compute the cost function given a set of features / values, and the values for our thetas.
   
    This should be the same code as the compute_cost function in the lesson #3 exercises. But
    feel free to implement your own.
    """
   
    # your code here
    return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.
   
    This is the same gradient descent code as in the lesson #3 exercises. But feel free
    to implement your own.
    """
    m = len(values)
    cost_history = []
    for i in range(num_iterations):
        # your code here
    return theta, pandas.Series(cost_history)
def predictions(dataframe):
    '''
    The NYC turnstile data is stored in a pandas dataframe called weather_turnstile.
    Using the information stored in the dataframe, lets predict the ridership of
    the NYC subway using linear regression with gradient descent.
   
    You can look at information contained in the turnstile weather dataframe
    at the link below:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv   
   
    Your prediction should have a R^2 value of .40 or better.
   
    Note: due to the memory and CPU limitation of our amazon EC2 instance, we will
    give you a random subet (~15%) of the data contained in turnstile_data_master_with_weather.csv
   
    If you receive a "server has encountered an error" message, that means you are hitting
    the 30 second  limit that's placed on running your program. Try using a smaller number
    for num_iterations if that's the case.
   
    Or if you are using your own algorithm/modesl, see if you can optimize your code so it
    runs faster.
    '''
    dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
    features = dataframe[['rain', 'precipi', 'Hour', 'meantempi']].join(dummy_units)
    values = dataframe[['ENTRIESn_hourly']]
    m = len(values)
    features, mu, sigma = normalize_features(features)
    features['ones'] = np.ones(m)
    features_array = np.array(features)
    values_array = np.array(values).flatten()
    #Set values for alpha, number of iterations.
    alpha = 0.1 # please feel free to play with this value
    num_iterations = 75 # please feel free to play with this value
    #Initialize theta, perform gradient descent
    theta_gradient_descent = np.zeros(len(features.columns))
    theta_gradient_descent, cost_history = gradient_descent(features_array, values_array, theta_gradient_descent,
                                                            alpha, num_iterations)
    predictions = np.dot(features_array, theta_gradient_descent)
    return predictions


Exercise 6 Plotting Residuals

import numpy as np
import scipy
import matplotlib.pyplot as plt
def plot_residuals(turnstile_weather, predictions):
    '''
    Using the same methods that we used to plot a histogram of entries
    per hour for our data, why don't you make a histogram of the residuals
    (that is, the difference between the original hourly entry data and the predicted values).
    Based on this residual histogram, do you have any insight into how our model
    performed?  Reading a bit on this webpage might be useful:
    http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm
    '''
   
    plt.figure()
    (turnstile_weather['''choose the right column here'''] - predictions).hist()
    return plt


Exercise 7 Compute R Squared

import numpy as np
import scipy
import matplotlib.pyplot as plt
import sys
def compute_r_squared(data, predictions):
    '''
    In exercise 5, we calculated the R^2 value for you. But why don't you try and
    and calculate the R^2 value yourself.
   
    Given a list of original data points, and also a list of predicted data points,
    write a function that will compute and return the coefficient of determination (R^2)
    for this data.  numpy.mean() and numpy.sum() might both be useful here, but
    not necessary.
    Documentation about numpy.mean() and numpy.sum() below:
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
    '''
   
    # your code here
   
    return r_squared


Exercise 8 Non Gradient Decent Linear Regression

import numpy as np
import pandas
import scipy
import statsmodels
"""
In this optional exercise, you should complete the function called
predictions(turnstile_weather). This function takes in our pandas
turnstile weather dataframe, and returns a set of predicted ridership values,
based on the other information in the dataframe. 
You should attempt to implement another type of linear regression,
that you may have read about, such as ordinary least squares regression:
http://en.wikipedia.org/wiki/Ordinary_least_squares
This is your playground. Go wild!
How does your choice of linear regression compare to linear regression
with gradient descent?
You can look at the information contained in the turnstile_weather dataframe below:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
Note: due to the memory and CPU limitation of our amazon EC2 instance, we will
give you a random subset (~15%) of the data contained in turnstile_data_master_with_weather.csv
If you receive a "server has encountered an error" message, that means you are hitting
the 30 second limit that's placed on running your program. See if you can optimize your code so it
runs faster.
"""
def predictions(weather_turnstile):
    #
    # Your implementation goes here. Feel free to write additional
    # helper functions
    #
    return prediction






No comments:

Post a Comment