Friday, June 20, 2014

Lesson 3 Data Analysis

Statistical Rigor
    Significant tests
    - using our data, can we disprove an assumption with a pre-defined level of confidence?

Why is statistics useful?
- They provide a formalized framework for comparing and evaluating data.
- They enable us to evaluate whether perceiving effects in our dataset reflect differences across the whole population.

Statistical Significant Tests
- many tests might make assumptions about data's distribution.
- very common distribution - normal distribution (aka Gaussian distribution, bell curve)

t-test
accept or reject a null hypothesis.
Null hypothesis: a statement we are trying to disprove by running our test.
- two samples can from the same population
- a sample is drawn from a probability distribution
specified in terms of a test statistic

Test Statistic: one number that helps accept or reject the null hypothesis
t test - t

A few different versions depending on assumptions
- equal sample size?
- same variance?

Two sample test

calculate t, calculate nu, calculate p;
p value: probability of obtaining a test statistic at least as extreme as ours if null hypothesis was true.
set p critical, if p<p critical, reject null hypothesis
else cannot reject null hypothesis.

Exercise: Calculate t and Nu

Welch's t-test in Python
Is there a simple way to do this in Python?

import scipy.stats
scipy.stats ttest_ind(list_1, list_2, equal_var_false)

scipy.stats ttest_ind assumes a two-side test. How could we use the output to instead perform a one-side test?
great, p/2 < p ritical, t>0
less, p/2 >p critical, t<0


Exercise Welch's t-test

import numpy
import scipy.stats
import pandas
def compare_averages(filename):
    """
    Performs a t-test on two sets of baseball data (left-handed and right-handed hitters).
    You will be given a csv file that has three columns.  A player's
    name, handedness (L for lefthanded or R for righthanded) and their
    career batting average (called 'avg'). You can look at the csv
    file via the following link:
    https://www.dropbox.com/s/xcn0u2uxm8c4n6l/baseball_data.csv
   
    Write a function that will read that the csv file into a pandas data frame,
    and run Welch's t-test on the two cohorts defined by handedness.
   
    One cohort should be a data frame of right-handed batters. And the other
    cohort should be a data frame of left-handed batters.
   
    We have included the scipy.stats library to help you write
    or implement Welch's t-test:
    http://docs.scipy.org/doc/scipy/reference/stats.html
   
    With a significance level of 95%, if there is no difference
    between the two cohorts, return a tuple consisting of
    True, and then the tuple returned by scipy.stats.ttest. 
   
    If there is a difference, return a tuple consisting of
    False, and then the tuple returned by scipy.stats.ttest.
   
    For example, the tuple that you return may look like:
    (True, (9.93570222, 0.000023))
    """

Non Parametric Tests
A statistical test that does not assume our data is drawn from any particular underlying probability distribution.

Mann-Whitney U test: tests null hypothesis that the two populations are the same.
u,p=scipy.stats.Mannwhitneyu(x,y)

Non-normal Data
Shapro-Wilk test
w,p=scipy.stats.Shapiro(data)

Just a tip of iceberg
Many different statistical methods!
Data scientists have large toolkits.

Predicting Future Data
How might we use the data we've collected to make predictions the data we don't have?

Machine Learning: a branch of artificial intelligence focuses on constructing systems that learn from large amounts of data to make predictions.

Why machine learning is useful?

Statistics vs. Machine Learning
What is the difference between statistics and machine learning?
- Statistics is focused on analyzing existing data, and drawing valid conclusions.
- Machine learning is focused on making predictions.

Types of Machine Learning
Data- Model- Predictions
Supervising Learning: such detect spam emails
- Have examples with input and output;
- Predict output for future, input-only data;
- Classification;
- Regression.

Unsupervised Learning:
- Trying to understand structure of data
- Clustering

Kurt's favorite machine learning algorithm: clustering and CPA.

Predicting HR: example of supervising learning
- can we write an equation that takes a bunch of info (e.g., height, weight, birth year, position, for baseball players) and predict HR (homerun)? Using regression.

Linear Regression with Gradient Decent
If we have age, height, weight, and batting average, and are trying to predict batting average, what are input variables? Age, Height, and Weight.

Gradient Decent
Cost function: J(0), we want to minimize this!

how: apha, learning rate (it's like a decreasing slope rate)

Gradient Decent in Python

import numpy
import pandas
def compute_cost(features, values, theta):
    """
    Compute the cost function given a set of features / values, and values for our thetas.
    """
    m = len(values)
    sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
    cost = sum_of_square_errors / (2*m)
    return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.
    """
    # Write some code here that updates the values of theta a number of times equal to
    # num_iterations.  Everytime you have computed the cost for a given set of thetas,
    # you should append it to cost_history.  The function should return both the final
    # values of theta and the cost history.
    # YOUR CODE GOES HERE
    m=len(values)
    cost_history=[]
   
    for I in range (num_iterations):
         predicted_values=numpy.dot(features, theta)
         theta=theta - alpha/m*numpy.dot((predicted_values - values), features)
  
         cost=compute_cost(features, values, theta)
         cost_history.append(cost)

    return theta, pandas.Series(cost_history)

Good job! Your program and algorithm worked

Theta =
[ 45.35759233  -9.02442042  13.69229668]

Cost History =
0     3748.133469
1     3727.492258
2     3707.261946
3     3687.434249
4     3668.001052
5     3648.954405
6     3630.286519
7     3611.989767
8     3594.056675
9     3576.479921
10    3559.252334
11    3542.366888
12    3525.816700
13    3509.595027
14    3493.695263
...
985    2686.730820
986    2686.730290
987    2686.729764
988    2686.729240
989    2686.728720
990    2686.728203
991    2686.727690
992    2686.727179
993    2686.726672
994    2686.726168
995    2686.725668
996    2686.725170
997    2686.724676
998    2686.724185
999    2686.723697
Length: 1000, dtype: float64


Coefficient of Determination: R squared
data: yi,.......,yn
preductions: fi,......,fn
average of data= y-

R2 (R squared)

Calculating R Squared

import numpy as np
def compute_r_squared(data, predictions):
    # Write a function that, given two input numpy arrays, 'data', and 'predictions,'
    # returns the coefficient of determination, R^2, for the model that produced
    # predictions.
    #
    # Numpy has a couple of functions -- np.mean() and np.sum() --
    # that you might find useful, but you don't have to use them.
    # YOUR CODE GOES HERE
   
    SST=((data-np.mean(data)**2).sum()
    SSReg=((predictions-data)**2).sum()
    r_squared=1-SSReg/SST

    return r_squared


Additional Considerations
- other types of linear regression
   - ordinary least squares regression
- parameter estimation: what are the confidence intervals of our parameters? What is the likelihood we would calculate this parameter value if the parameter had no effect?
- under/ over fitting
- multiple local minima: use different random initial thetas; seed random values for repeatability.

Assignment #3
t-test: does subway ridership more in rainy or weekend?
linear regression with gradient decent: how many subway riders n a particular day and time? time of day, weekday of week, rainy or not rainy day,...


No comments:

Post a Comment