Statistical Rigor
Significant tests
- using our data, can we disprove an assumption with a pre-defined level of confidence?
Why is statistics useful?
- They provide a formalized framework for comparing and evaluating data.
- They enable us to evaluate whether perceiving effects in our dataset reflect differences across the whole population.
Statistical Significant Tests
- many tests might make assumptions about data's distribution.
- very common distribution - normal distribution (aka Gaussian distribution, bell curve)
t-test
accept or reject a null hypothesis.
Null hypothesis: a statement we are trying to disprove by running our test.
- two samples can from the same population
- a sample is drawn from a probability distribution
specified in terms of a test statistic
Test Statistic: one number that helps accept or reject the null hypothesis
t test - t
A few different versions depending on assumptions
- equal sample size?
- same variance?
Two sample test
calculate t, calculate nu, calculate p;
p value: probability of obtaining a test statistic at least as extreme as ours if null hypothesis was true.
set p critical, if p<p critical, reject null hypothesis
else cannot reject null hypothesis.
Exercise: Calculate t and Nu
Welch's t-test in Python
Is there a simple way to do this in Python?
import scipy.stats
scipy.stats ttest_ind(list_1, list_2, equal_var_false)
scipy.stats ttest_ind assumes a two-side test. How could we use the output to instead perform a one-side test?
great, p/2 < p ritical, t>0
less, p/2 >p critical, t<0
Exercise Welch's t-test
import numpy
import scipy.stats
import pandas
def compare_averages(filename):
"""
Performs a t-test on two sets of baseball data (left-handed and right-handed hitters).
You will be given a csv file that has three columns. A player's
name, handedness (L for lefthanded or R for righthanded) and their
career batting average (called 'avg'). You can look at the csv
file via the following link:
https://www.dropbox.com/s/xcn0u2uxm8c4n6l/baseball_data.csv
Write a function that will read that the csv file into a pandas data frame,
and run Welch's t-test on the two cohorts defined by handedness.
One cohort should be a data frame of right-handed batters. And the other
cohort should be a data frame of left-handed batters.
We have included the scipy.stats library to help you write
or implement Welch's t-test:
http://docs.scipy.org/doc/scipy/reference/stats.html
With a significance level of 95%, if there is no difference
between the two cohorts, return a tuple consisting of
True, and then the tuple returned by scipy.stats.ttest.
If there is a difference, return a tuple consisting of
False, and then the tuple returned by scipy.stats.ttest.
For example, the tuple that you return may look like:
(True, (9.93570222, 0.000023))
"""
Non Parametric Tests
A statistical test that does not assume our data is drawn from any particular underlying probability distribution.
Mann-Whitney U test: tests null hypothesis that the two populations are the same.
u,p=scipy.stats.Mannwhitneyu(x,y)
Non-normal Data
Shapro-Wilk test
w,p=scipy.stats.Shapiro(data)
Just a tip of iceberg
Many different statistical methods!
Data scientists have large toolkits.
Predicting Future Data
How might we use the data we've collected to make predictions the data we don't have?
Machine Learning: a branch of artificial intelligence focuses on constructing systems that learn from large amounts of data to make predictions.
Why machine learning is useful?
Statistics vs. Machine Learning
What is the difference between statistics and machine learning?
- Statistics is focused on analyzing existing data, and drawing valid conclusions.
- Machine learning is focused on making predictions.
Types of Machine Learning
Data- Model- Predictions
Supervising Learning: such detect spam emails
- Have examples with input and output;
- Predict output for future, input-only data;
- Classification;
- Regression.
Unsupervised Learning:
- Trying to understand structure of data
- Clustering
Kurt's favorite machine learning algorithm: clustering and CPA.
Predicting HR: example of supervising learning
- can we write an equation that takes a bunch of info (e.g., height, weight, birth year, position, for baseball players) and predict HR (homerun)? Using regression.
Linear Regression with Gradient Decent
If we have age, height, weight, and batting average, and are trying to predict batting average, what are input variables? Age, Height, and Weight.
Gradient Decent
Cost function: J(0), we want to minimize this!
how: apha, learning rate (it's like a decreasing slope rate)
Gradient Decent in Python
import numpy
import pandas
def compute_cost(features, values, theta):
"""
Compute the cost function given a set of features / values, and values for our thetas.
"""
m = len(values)
sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
cost = sum_of_square_errors / (2*m)
return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
"""
Perform gradient descent given a data set with an arbitrary number of features.
"""
# Write some code here that updates the values of theta a number of times equal to
# num_iterations. Everytime you have computed the cost for a given set of thetas,
# you should append it to cost_history. The function should return both the final
# values of theta and the cost history.
# YOUR CODE GOES HERE
m=len(values)
cost_history=[]
for I in range (num_iterations):
predicted_values=numpy.dot(features, theta)
theta=theta - alpha/m*numpy.dot((predicted_values - values), features)
cost=compute_cost(features, values, theta)
cost_history.append(cost)
return theta, pandas.Series(cost_history)
Good job! Your program and algorithm worked
Theta =
[ 45.35759233 -9.02442042 13.69229668]
Cost History =
0 3748.133469
1 3727.492258
2 3707.261946
3 3687.434249
4 3668.001052
5 3648.954405
6 3630.286519
7 3611.989767
8 3594.056675
9 3576.479921
10 3559.252334
11 3542.366888
12 3525.816700
13 3509.595027
14 3493.695263
...
985 2686.730820
986 2686.730290
987 2686.729764
988 2686.729240
989 2686.728720
990 2686.728203
991 2686.727690
992 2686.727179
993 2686.726672
994 2686.726168
995 2686.725668
996 2686.725170
997 2686.724676
998 2686.724185
999 2686.723697
Length: 1000, dtype: float64
Coefficient of Determination: R squared
data: yi,.......,yn
preductions: fi,......,fn
average of data= y-
R2 (R squared)
Calculating R Squared
import numpy as np
def compute_r_squared(data, predictions):
# Write a function that, given two input numpy arrays, 'data', and 'predictions,'
# returns the coefficient of determination, R^2, for the model that produced
# predictions.
#
# Numpy has a couple of functions -- np.mean() and np.sum() --
# that you might find useful, but you don't have to use them.
# YOUR CODE GOES HERE
SST=((data-np.mean(data)**2).sum()
SSReg=((predictions-data)**2).sum()
r_squared=1-SSReg/SST
return r_squared
Additional Considerations
- other types of linear regression
- ordinary least squares regression
- parameter estimation: what are the confidence intervals of our parameters? What is the likelihood we would calculate this parameter value if the parameter had no effect?
- under/ over fitting
- multiple local minima: use different random initial thetas; seed random values for repeatability.
Assignment #3
t-test: does subway ridership more in rainy or weekend?
linear regression with gradient decent: how many subway riders n a particular day and time? time of day, weekday of week, rainy or not rainy day,...
No comments:
Post a Comment