Exercise 1 - Exploratory Data Analysis
import numpy as np
import pandas
import matplotlib.pyplot as plt
def entries_histogram(turnstile_weather):
'''
Before we perform any analysis, it might be useful to take a
look at the data we're hoping to analyze. More specifically, let's
examine the hourly entries in our NYC subway data and determine what
distribution the data follows. This data is stored in a dataframe
called turnstile_weather under the ['ENTRIESn_hourly'] column.
Let's plot two histograms on the same axes to show hourly
entries when raining vs. when not raining. Here's an example on how
to plot histograms with pandas and matplotlib:
turnstile_weather['column_to_graph'].hist()
Your histograph may look similar to bar graph in the instructor notes below.
You can read a bit about using matplotlib and pandas to plot histograms here:
http://pandas.pydata.org/pandas-docs/stable/visualization.html#histograms
You can see the information contained within the turnstile weather data here:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
'''
plt.figure()
turnstile_weather['...'] # your code here to plot a historgram for hourly entries when it is raining
turnstile_weather['...'] # your code here to plot a historgram for hourly entries when it is not raining
return plt
Exercise 2 - Welch's t-test?
Does entries data from previous exercise seem normally distributed? No
Can we run Welch's T Test on entries data? Yes.
Correct. Good job!
Exercise 3 - Mann Whitney U Test
import numpy as np
import scipy
import scipy.stats
import pandas
def mann_whitney_plus_means(turnstile_weather):
'''
This function will consume the turnstile_weather dataframe containing
our final turnstile weather data.
You will want to take the means and run the Mann Whitney U test on the
ENTRIESn_hourly column in the turnstile_weather dataframe.
This function should return:
1) the mean of entries with rain
2) the mean of entries without rain
3) the Mann-Whitney U statistic and p-value comparing number of entries
with rain and the number of entries without rain.
You should feel free to use scipy's Mann-Whitney implementation, and
also might find it useful to use numpy's mean function.
Here are some documentations:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
You can look at the final turnstile weather data at the link below:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
'''
return with_rain_mean, without_rain_mean, U, p
Exercise 4 Rainy Day Ridership vs. Non Rainy Ridership
Is the distribution of the number of entries statistically different between rainy and non rainy days? Yes.
Described your results and methods. Rainy days have more ridership.
Correct. Good job!
Exercise 5 Linear Regression
import numpy as np
import pandas
def normalize_features(array):
"""
Normalize the features in our data set.
"""
array_normalized = (array-array.mean())/array.std()
mu = array.mean()
sigma = array.std()
return array_normalized, mu, sigma
def compute_cost(features, values, theta):
"""
Compute the cost function given a set of features / values, and the values for our thetas.
This should be the same code as the compute_cost function in the lesson #3 exercises. But
feel free to implement your own.
"""
# your code here
return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
"""
Perform gradient descent given a data set with an arbitrary number of features.
This is the same gradient descent code as in the lesson #3 exercises. But feel free
to implement your own.
"""
m = len(values)
cost_history = []
for i in range(num_iterations):
# your code here
return theta, pandas.Series(cost_history)
def predictions(dataframe):
'''
The NYC turnstile data is stored in a pandas dataframe called weather_turnstile.
Using the information stored in the dataframe, lets predict the ridership of
the NYC subway using linear regression with gradient descent.
You can look at information contained in the turnstile weather dataframe
at the link below:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
Your prediction should have a R^2 value of .40 or better.
Note: due to the memory and CPU limitation of our amazon EC2 instance, we will
give you a random subet (~15%) of the data contained in turnstile_data_master_with_weather.csv
If you receive a "server has encountered an error" message, that means you are hitting
the 30 second limit that's placed on running your program. Try using a smaller number
for num_iterations if that's the case.
Or if you are using your own algorithm/modesl, see if you can optimize your code so it
runs faster.
'''
dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
features = dataframe[['rain', 'precipi', 'Hour', 'meantempi']].join(dummy_units)
values = dataframe[['ENTRIESn_hourly']]
m = len(values)
features, mu, sigma = normalize_features(features)
features['ones'] = np.ones(m)
features_array = np.array(features)
values_array = np.array(values).flatten()
#Set values for alpha, number of iterations.
alpha = 0.1 # please feel free to play with this value
num_iterations = 75 # please feel free to play with this value
#Initialize theta, perform gradient descent
theta_gradient_descent = np.zeros(len(features.columns))
theta_gradient_descent, cost_history = gradient_descent(features_array, values_array, theta_gradient_descent,
alpha, num_iterations)
predictions = np.dot(features_array, theta_gradient_descent)
return predictions
Exercise 6 Plotting Residuals
import numpy as np
import scipy
import matplotlib.pyplot as plt
def plot_residuals(turnstile_weather, predictions):
'''
Using the same methods that we used to plot a histogram of entries
per hour for our data, why don't you make a histogram of the residuals
(that is, the difference between the original hourly entry data and the predicted values).
Based on this residual histogram, do you have any insight into how our model
performed? Reading a bit on this webpage might be useful:
http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm
'''
plt.figure()
(turnstile_weather['''choose the right column here'''] - predictions).hist()
return plt
Exercise 7 Compute R Squared
import numpy as np
import scipy
import matplotlib.pyplot as plt
import sys
def compute_r_squared(data, predictions):
'''
In exercise 5, we calculated the R^2 value for you. But why don't you try and
and calculate the R^2 value yourself.
Given a list of original data points, and also a list of predicted data points,
write a function that will compute and return the coefficient of determination (R^2)
for this data. numpy.mean() and numpy.sum() might both be useful here, but
not necessary.
Documentation about numpy.mean() and numpy.sum() below:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
'''
# your code here
return r_squared
Exercise 8 Non Gradient Decent Linear Regression
import numpy as np
import pandas
import scipy
import statsmodels
"""
In this optional exercise, you should complete the function called
predictions(turnstile_weather). This function takes in our pandas
turnstile weather dataframe, and returns a set of predicted ridership values,
based on the other information in the dataframe.
You should attempt to implement another type of linear regression,
that you may have read about, such as ordinary least squares regression:
http://en.wikipedia.org/wiki/Ordinary_least_squares
This is your playground. Go wild!
How does your choice of linear regression compare to linear regression
with gradient descent?
You can look at the information contained in the turnstile_weather dataframe below:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
Note: due to the memory and CPU limitation of our amazon EC2 instance, we will
give you a random subset (~15%) of the data contained in turnstile_data_master_with_weather.csv
If you receive a "server has encountered an error" message, that means you are hitting
the 30 second limit that's placed on running your program. See if you can optimize your code so it
runs faster.
"""
def predictions(weather_turnstile):
#
# Your implementation goes here. Feel free to write additional
# helper functions
#
return prediction
Monday, June 23, 2014
Friday, June 20, 2014
Lesson 3 Data Analysis
Statistical Rigor
Significant tests
- using our data, can we disprove an assumption with a pre-defined level of confidence?
Why is statistics useful?
- They provide a formalized framework for comparing and evaluating data.
- They enable us to evaluate whether perceiving effects in our dataset reflect differences across the whole population.
Statistical Significant Tests
- many tests might make assumptions about data's distribution.
- very common distribution - normal distribution (aka Gaussian distribution, bell curve)
t-test
accept or reject a null hypothesis.
Null hypothesis: a statement we are trying to disprove by running our test.
- two samples can from the same population
- a sample is drawn from a probability distribution
specified in terms of a test statistic
Test Statistic: one number that helps accept or reject the null hypothesis
t test - t
A few different versions depending on assumptions
- equal sample size?
- same variance?
Two sample test
calculate t, calculate nu, calculate p;
p value: probability of obtaining a test statistic at least as extreme as ours if null hypothesis was true.
set p critical, if p<p critical, reject null hypothesis
else cannot reject null hypothesis.
Exercise: Calculate t and Nu
Welch's t-test in Python
Is there a simple way to do this in Python?
import scipy.stats
scipy.stats ttest_ind(list_1, list_2, equal_var_false)
scipy.stats ttest_ind assumes a two-side test. How could we use the output to instead perform a one-side test?
great, p/2 < p ritical, t>0
less, p/2 >p critical, t<0
Exercise Welch's t-test
import numpy
import scipy.stats
import pandas
def compare_averages(filename):
"""
Performs a t-test on two sets of baseball data (left-handed and right-handed hitters).
You will be given a csv file that has three columns. A player's
name, handedness (L for lefthanded or R for righthanded) and their
career batting average (called 'avg'). You can look at the csv
file via the following link:
https://www.dropbox.com/s/xcn0u2uxm8c4n6l/baseball_data.csv
Write a function that will read that the csv file into a pandas data frame,
and run Welch's t-test on the two cohorts defined by handedness.
One cohort should be a data frame of right-handed batters. And the other
cohort should be a data frame of left-handed batters.
We have included the scipy.stats library to help you write
or implement Welch's t-test:
http://docs.scipy.org/doc/scipy/reference/stats.html
With a significance level of 95%, if there is no difference
between the two cohorts, return a tuple consisting of
True, and then the tuple returned by scipy.stats.ttest.
If there is a difference, return a tuple consisting of
False, and then the tuple returned by scipy.stats.ttest.
For example, the tuple that you return may look like:
(True, (9.93570222, 0.000023))
"""
Non Parametric Tests
A statistical test that does not assume our data is drawn from any particular underlying probability distribution.
Mann-Whitney U test: tests null hypothesis that the two populations are the same.
u,p=scipy.stats.Mannwhitneyu(x,y)
Non-normal Data
Shapro-Wilk test
w,p=scipy.stats.Shapiro(data)
Just a tip of iceberg
Many different statistical methods!
Data scientists have large toolkits.
Predicting Future Data
How might we use the data we've collected to make predictions the data we don't have?
Machine Learning: a branch of artificial intelligence focuses on constructing systems that learn from large amounts of data to make predictions.
Why machine learning is useful?
Statistics vs. Machine Learning
What is the difference between statistics and machine learning?
- Statistics is focused on analyzing existing data, and drawing valid conclusions.
- Machine learning is focused on making predictions.
Types of Machine Learning
Data- Model- Predictions
Supervising Learning: such detect spam emails
- Have examples with input and output;
- Predict output for future, input-only data;
- Classification;
- Regression.
Unsupervised Learning:
- Trying to understand structure of data
- Clustering
Kurt's favorite machine learning algorithm: clustering and CPA.
Predicting HR: example of supervising learning
- can we write an equation that takes a bunch of info (e.g., height, weight, birth year, position, for baseball players) and predict HR (homerun)? Using regression.
Linear Regression with Gradient Decent
If we have age, height, weight, and batting average, and are trying to predict batting average, what are input variables? Age, Height, and Weight.
Gradient Decent
Cost function: J(0), we want to minimize this!
how: apha, learning rate (it's like a decreasing slope rate)
Gradient Decent in Python
import numpy
import pandas
def compute_cost(features, values, theta):
"""
Compute the cost function given a set of features / values, and values for our thetas.
"""
m = len(values)
sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
cost = sum_of_square_errors / (2*m)
return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
"""
Perform gradient descent given a data set with an arbitrary number of features.
"""
# Write some code here that updates the values of theta a number of times equal to
# num_iterations. Everytime you have computed the cost for a given set of thetas,
# you should append it to cost_history. The function should return both the final
# values of theta and the cost history.
# YOUR CODE GOES HERE
m=len(values)
cost_history=[]
for I in range (num_iterations):
predicted_values=numpy.dot(features, theta)
theta=theta - alpha/m*numpy.dot((predicted_values - values), features)
cost=compute_cost(features, values, theta)
cost_history.append(cost)
return theta, pandas.Series(cost_history)
Good job! Your program and algorithm worked
Theta =
[ 45.35759233 -9.02442042 13.69229668]
Cost History =
0 3748.133469
1 3727.492258
2 3707.261946
3 3687.434249
4 3668.001052
5 3648.954405
6 3630.286519
7 3611.989767
8 3594.056675
9 3576.479921
10 3559.252334
11 3542.366888
12 3525.816700
13 3509.595027
14 3493.695263
...
985 2686.730820
986 2686.730290
987 2686.729764
988 2686.729240
989 2686.728720
990 2686.728203
991 2686.727690
992 2686.727179
993 2686.726672
994 2686.726168
995 2686.725668
996 2686.725170
997 2686.724676
998 2686.724185
999 2686.723697
Length: 1000, dtype: float64
Coefficient of Determination: R squared
data: yi,.......,yn
preductions: fi,......,fn
average of data= y-
R2 (R squared)
Calculating R Squared
import numpy as np
def compute_r_squared(data, predictions):
# Write a function that, given two input numpy arrays, 'data', and 'predictions,'
# returns the coefficient of determination, R^2, for the model that produced
# predictions.
#
# Numpy has a couple of functions -- np.mean() and np.sum() --
# that you might find useful, but you don't have to use them.
# YOUR CODE GOES HERE
SST=((data-np.mean(data)**2).sum()
SSReg=((predictions-data)**2).sum()
r_squared=1-SSReg/SST
return r_squared
Additional Considerations
- other types of linear regression
- ordinary least squares regression
- parameter estimation: what are the confidence intervals of our parameters? What is the likelihood we would calculate this parameter value if the parameter had no effect?
- under/ over fitting
- multiple local minima: use different random initial thetas; seed random values for repeatability.
Assignment #3
t-test: does subway ridership more in rainy or weekend?
linear regression with gradient decent: how many subway riders n a particular day and time? time of day, weekday of week, rainy or not rainy day,...
Significant tests
- using our data, can we disprove an assumption with a pre-defined level of confidence?
Why is statistics useful?
- They provide a formalized framework for comparing and evaluating data.
- They enable us to evaluate whether perceiving effects in our dataset reflect differences across the whole population.
Statistical Significant Tests
- many tests might make assumptions about data's distribution.
- very common distribution - normal distribution (aka Gaussian distribution, bell curve)
t-test
accept or reject a null hypothesis.
Null hypothesis: a statement we are trying to disprove by running our test.
- two samples can from the same population
- a sample is drawn from a probability distribution
specified in terms of a test statistic
Test Statistic: one number that helps accept or reject the null hypothesis
t test - t
A few different versions depending on assumptions
- equal sample size?
- same variance?
Two sample test
calculate t, calculate nu, calculate p;
p value: probability of obtaining a test statistic at least as extreme as ours if null hypothesis was true.
set p critical, if p<p critical, reject null hypothesis
else cannot reject null hypothesis.
Exercise: Calculate t and Nu
Welch's t-test in Python
Is there a simple way to do this in Python?
import scipy.stats
scipy.stats ttest_ind(list_1, list_2, equal_var_false)
scipy.stats ttest_ind assumes a two-side test. How could we use the output to instead perform a one-side test?
great, p/2 < p ritical, t>0
less, p/2 >p critical, t<0
Exercise Welch's t-test
import numpy
import scipy.stats
import pandas
def compare_averages(filename):
"""
Performs a t-test on two sets of baseball data (left-handed and right-handed hitters).
You will be given a csv file that has three columns. A player's
name, handedness (L for lefthanded or R for righthanded) and their
career batting average (called 'avg'). You can look at the csv
file via the following link:
https://www.dropbox.com/s/xcn0u2uxm8c4n6l/baseball_data.csv
Write a function that will read that the csv file into a pandas data frame,
and run Welch's t-test on the two cohorts defined by handedness.
One cohort should be a data frame of right-handed batters. And the other
cohort should be a data frame of left-handed batters.
We have included the scipy.stats library to help you write
or implement Welch's t-test:
http://docs.scipy.org/doc/scipy/reference/stats.html
With a significance level of 95%, if there is no difference
between the two cohorts, return a tuple consisting of
True, and then the tuple returned by scipy.stats.ttest.
If there is a difference, return a tuple consisting of
False, and then the tuple returned by scipy.stats.ttest.
For example, the tuple that you return may look like:
(True, (9.93570222, 0.000023))
"""
Non Parametric Tests
A statistical test that does not assume our data is drawn from any particular underlying probability distribution.
Mann-Whitney U test: tests null hypothesis that the two populations are the same.
u,p=scipy.stats.Mannwhitneyu(x,y)
Non-normal Data
Shapro-Wilk test
w,p=scipy.stats.Shapiro(data)
Just a tip of iceberg
Many different statistical methods!
Data scientists have large toolkits.
Predicting Future Data
How might we use the data we've collected to make predictions the data we don't have?
Machine Learning: a branch of artificial intelligence focuses on constructing systems that learn from large amounts of data to make predictions.
Why machine learning is useful?
Statistics vs. Machine Learning
What is the difference between statistics and machine learning?
- Statistics is focused on analyzing existing data, and drawing valid conclusions.
- Machine learning is focused on making predictions.
Types of Machine Learning
Data- Model- Predictions
Supervising Learning: such detect spam emails
- Have examples with input and output;
- Predict output for future, input-only data;
- Classification;
- Regression.
Unsupervised Learning:
- Trying to understand structure of data
- Clustering
Kurt's favorite machine learning algorithm: clustering and CPA.
Predicting HR: example of supervising learning
- can we write an equation that takes a bunch of info (e.g., height, weight, birth year, position, for baseball players) and predict HR (homerun)? Using regression.
Linear Regression with Gradient Decent
If we have age, height, weight, and batting average, and are trying to predict batting average, what are input variables? Age, Height, and Weight.
Gradient Decent
Cost function: J(0), we want to minimize this!
how: apha, learning rate (it's like a decreasing slope rate)
Gradient Decent in Python
import numpy
import pandas
def compute_cost(features, values, theta):
"""
Compute the cost function given a set of features / values, and values for our thetas.
"""
m = len(values)
sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
cost = sum_of_square_errors / (2*m)
return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
"""
Perform gradient descent given a data set with an arbitrary number of features.
"""
# Write some code here that updates the values of theta a number of times equal to
# num_iterations. Everytime you have computed the cost for a given set of thetas,
# you should append it to cost_history. The function should return both the final
# values of theta and the cost history.
# YOUR CODE GOES HERE
m=len(values)
cost_history=[]
for I in range (num_iterations):
predicted_values=numpy.dot(features, theta)
theta=theta - alpha/m*numpy.dot((predicted_values - values), features)
cost=compute_cost(features, values, theta)
cost_history.append(cost)
return theta, pandas.Series(cost_history)
Good job! Your program and algorithm worked
Theta =
[ 45.35759233 -9.02442042 13.69229668]
Cost History =
0 3748.133469
1 3727.492258
2 3707.261946
3 3687.434249
4 3668.001052
5 3648.954405
6 3630.286519
7 3611.989767
8 3594.056675
9 3576.479921
10 3559.252334
11 3542.366888
12 3525.816700
13 3509.595027
14 3493.695263
...
985 2686.730820
986 2686.730290
987 2686.729764
988 2686.729240
989 2686.728720
990 2686.728203
991 2686.727690
992 2686.727179
993 2686.726672
994 2686.726168
995 2686.725668
996 2686.725170
997 2686.724676
998 2686.724185
999 2686.723697
Length: 1000, dtype: float64
Coefficient of Determination: R squared
data: yi,.......,yn
preductions: fi,......,fn
average of data= y-
R2 (R squared)
Calculating R Squared
import numpy as np
def compute_r_squared(data, predictions):
# Write a function that, given two input numpy arrays, 'data', and 'predictions,'
# returns the coefficient of determination, R^2, for the model that produced
# predictions.
#
# Numpy has a couple of functions -- np.mean() and np.sum() --
# that you might find useful, but you don't have to use them.
# YOUR CODE GOES HERE
SST=((data-np.mean(data)**2).sum()
SSReg=((predictions-data)**2).sum()
r_squared=1-SSReg/SST
return r_squared
Additional Considerations
- other types of linear regression
- ordinary least squares regression
- parameter estimation: what are the confidence intervals of our parameters? What is the likelihood we would calculate this parameter value if the parameter had no effect?
- under/ over fitting
- multiple local minima: use different random initial thetas; seed random values for repeatability.
Assignment #3
t-test: does subway ridership more in rainy or weekend?
linear regression with gradient decent: how many subway riders n a particular day and time? time of day, weekday of week, rainy or not rainy day,...
Project 2 - Wrangle NYC Subway and Weather Data
Exercise 1 Number of Rainy Days
import pandas
import pandasql
def num_rainy_days(filename):
'''
This function should run a SQL query on a dataframe of
weather data. The SQL query should return one column and
one row - a count of the number of days in the dataframe where
the rain column is equal to 1 (i.e., the number of days it
rained). The dataframe will be titled 'weather_data'. You'll
need to provide the SQL query. You might find SQL's count function
useful for this exercise. You can read more about it here:
https://dev.mysql.com/doc/refman/5.1/en/counting-rows.html
You might also find that interpreting numbers as integers or floats may not
work initially. In order to get around this issue, it may be equal to cast
these numbers as integers. This can be done by writing cast(column as integer).
So for example, if we wanted to cast the maxtempi column as an integer, we would actually
write something like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi = 76.
You can see the weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data = pandas.read_csv(filename)
q = """
your query here
"""
#Execute your SQL command against the pandas frame
rainy_days = pandasql.sqldf(q.lower(), locals())
return rainy_days
Exercise 2 Temp on Foggy and Non Foggy Days
import pandas
import pandasql
def max_temp_aggregate_by_fog(filename):
'''
This function should run a SQL query on a dataframe of
weather data. The SQL query should return two columns and
two rows - whether it was foggy or not (0 or 1) and the max
maxtempi for that fog value (i.e., the maximum max temperature
for both foggy and non-foggy days). The dataframe will be
titled 'weather_data'. You'll need to provide the SQL query.
You might also find that interpreting numbers as integers or floats may not
work initially. In order to get around this issue, it may be useful to cast
these numbers as integers. This can be done by writing cast(column as integer).
So for example, if we wanted to cast the maxtempi column as an integer, we would actually
write something like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi = 76.
You can see the weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data = pandas.read_csv(filename)
q = """
your query here
"""
#Execute your SQL command against the pandas frame
rainy_days = pandasql.sqldf(q.lower(), locals())
return rainy_days
Exercise 3 Mean Temp on Weekends
import pandas
import pandasql
def avg_min_temperature(filename):
'''
This function should run a SQL query on a dataframe of
weather data. The SQL query should return one column and
one row - the average meantempi on days that are a Saturday
or Sunday (i.e., the the average mean temperature on weekends).
The dataframe will be titled 'weather_data' and you can access
the date in the dataframe via the 'date' column.
You'll need to provide the SQL query.
You might also find that interpreting numbers as integers or floats may not
work initially. In order to get around this issue, it may be equal to cast
these numbers as integers. This can be done by writing cast(column as integer).
So for example, if we wanted to cast the maxtempi column as an integer, we would actually
write something like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi = 76.
Also, you can convert dates to days of the week via the 'strftime' keyword in SQL.
For example, cast (strftime('%w', date) as integer) will return 0 if the date
is a Sunday or 6 if the date is a Saturday.
You can see the weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data = pandas.read_csv(filename)
q = """
your query here
"""
#Execute your SQL command against the pandas frame
mean_temp_weekends = pandasql.sqldf(q.lower(), locals())
return mean_temp_weekends
Exercise 4 Mean Temp on Rainy Days
import pandas
import pandasql
def avg_min_temperature(filename):
'''
This function should run a SQL query on a dataframe of
weather data. More specifically you want to find the average
minimum temperature on rainy days where the minimum temperature
is greater than 55 degrees.
You might also find that interpreting numbers as integers or floats may not
work initially. In order to get around this issue, it may be equal to cast
these numbers as integers. This can be done by writing cast(column as integer).
So for example, if we wanted to cast the maxtempi column as an integer, we would actually
write something like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi = 76.
You can see the weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data = pandas.read_csv(filename)
q = """
your query here
"""
#Execute your SQL command against the pandas frame
mean_temp_weekends = pandasql.sqldf(q.lower(), locals())
return mean_temp_weekends
Exercise 5 Fixing NYC Subway Turnstile Data
import csv
def fix_turnstile_data(filenames):
'''
Filenames is a list of MTA Subway turnstile text files. A link to an example
MTA Subway turnstile text file can be seen at the URL below:
http://web.mta.info/developers/data/nyct/turnstile/turnstile_110507.txt
As you can see, there are numerous data points included in each row of the
a MTA Subway turnstile text file.
You want to write a function that will update each row in the text
file so there is only one entry per row. A few examples below:
A002,R051,02-00-00,05-28-11,00:00:00,REGULAR,003178521,001100739
A002,R051,02-00-00,05-28-11,04:00:00,REGULAR,003178541,001100746
A002,R051,02-00-00,05-28-11,08:00:00,REGULAR,003178559,001100775
Write the updates to a different text file in the format of "updated_" + filename.
For example:
1) if you read in a text file called "turnstile_110521.txt"
2) you should write the updated data to "updated_turnstile_110521.txt"
The order of the fields should be preserved.
You can see a sample of the turnstile text file that's passed into this function
and the the corresponding updated file in the links below:
Sample input file:
https://www.dropbox.com/s/mpin5zv4hgrx244/turnstile_110528.txt
Sample updated file:
https://www.dropbox.com/s/074xbgio4c39b7h/solution_turnstile_110528.txt
'''
for name in filenames:
# your code here
Exercise 6 Combining NYC Subway Turnstile Data
def create_master_turnstile_file(filenames, output_file):
'''
Write a function that takes the files in the list filenames, which all have the
columns 'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn', and consolidates
them into one file located at output_file. There should be ONE row with the column
headers, located at the top of the file.
For example, if file_1 has:
'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
line 1 ...
line 2 ...
and another file, file_2 has:
'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
line 3 ...
line 4 ...
line 5 ...
We need to combine file_1 and file_2 into a master_file like below:
'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
line 1 ...
line 2 ...
line 3 ...
line 4 ...
line 5 ...
'''
with open(output_file, 'w') as master_file:
master_file.write('C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn\n')
for filename in filenames:
# your code here
Exercise 7 Filtering "Non Regular" Data
import pandas
def filter_by_regular(filename):
'''
This function should read the csv file located at filename into a pandas dataframe,
and filter the dataframe to only rows where the 'DESCn' column has the value 'REGULAR'.
For example, if the pandas dataframe is as follows:
,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
1,A002,R051,02-00-00,05-01-11,04:00:00,DOOR,3144335,1088159
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
3,A002,R051,02-00-00,05-01-11,12:00:00,DOOR,3144424,1088231
The dataframe will look like below after filtering to only rows where DESCn column
has the value 'REGULAR':
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
'''
turnstile_data = # your code here
# more of your code here
return turnstile_data
Exercise 8 Get Hourly Entries
import pandas
def get_hourly_entries(df):
'''
The data in the MTA Subway Turnstile data reports on the cumulative
number of entries and exits per row. Assume that you have a dataframe
called df that contains only the rows for a particular turnstile machine
(i.e., unique SCP, C/A, and UNIT). This function should change
these cumulative entry numbers to a count of entries since the last reading
(i.e., entries since the last row in the dataframe).
More specifically, you want to do two things:
1) Create a new column called ENTRIESn_hourly
2) Assign to the column the difference between ENTRIESn of the current row
and the previous row. If there is any NaN, fill/replace it with 1.
You may find the pandas functions shift() and fillna() to be helpful in this exercise.
Examples of what your dataframe should look like at the end of this exercise:
C/A UNIT SCP DATEn TIMEn DESCn ENTRIESn EXITSn ENTRIESn_hourly
0 A002 R051 02-00-00 05-01-11 00:00:00 REGULAR 3144312 1088151 1
1 A002 R051 02-00-00 05-01-11 04:00:00 REGULAR 3144335 1088159 23
2 A002 R051 02-00-00 05-01-11 08:00:00 REGULAR 3144353 1088177 18
3 A002 R051 02-00-00 05-01-11 12:00:00 REGULAR 3144424 1088231 71
4 A002 R051 02-00-00 05-01-11 16:00:00 REGULAR 3144594 1088275 170
5 A002 R051 02-00-00 05-01-11 20:00:00 REGULAR 3144808 1088317 214
6 A002 R051 02-00-00 05-02-11 00:00:00 REGULAR 3144895 1088328 87
7 A002 R051 02-00-00 05-02-11 04:00:00 REGULAR 3144905 1088331 10
8 A002 R051 02-00-00 05-02-11 08:00:00 REGULAR 3144941 1088420 36
9 A002 R051 02-00-00 05-02-11 12:00:00 REGULAR 3145094 1088753 153
10 A002 R051 02-00-00 05-02-11 16:00:00 REGULAR 3145337 1088823 243
...
...
'''
#your code here
return df
Exercise 9 Get Hourly Exists
import pandas
def get_hourly_exits(df):
'''
The data in the MTA Subway Turnstile data reports on the cumulative
number of entries and exits per row. Assume that you have a dataframe
called df that contains only the rows for a particular turnstile machine
(i.e., unique SCP, C/A, and UNIT). This function should change
these cumulative exit numbers to a count of exits since the last reading
(i.e., exits since the last row in the dataframe).
More specifically, you want to do two things:
1) Create a new column called EXITSn_hourly
2) Assign to the column the difference between EXITSn of the current row
and the previous row. If there is any NaN, fill/replace it with 0.
You may find the pandas functions shift() and fillna() to be helpful in this exercise.
Example dataframe below:
Unnamed: 0 C/A UNIT SCP DATEn TIMEn DESCn ENTRIESn EXITSn ENTRIESn_hourly EXITSn_hourly
0 0 A002 R051 02-00-00 05-01-11 00:00:00 REGULAR 3144312 1088151 0 0
1 1 A002 R051 02-00-00 05-01-11 04:00:00 REGULAR 3144335 1088159 23 8
2 2 A002 R051 02-00-00 05-01-11 08:00:00 REGULAR 3144353 1088177 18 18
3 3 A002 R051 02-00-00 05-01-11 12:00:00 REGULAR 3144424 1088231 71 54
4 4 A002 R051 02-00-00 05-01-11 16:00:00 REGULAR 3144594 1088275 170 44
5 5 A002 R051 02-00-00 05-01-11 20:00:00 REGULAR 3144808 1088317 214 42
6 6 A002 R051 02-00-00 05-02-11 00:00:00 REGULAR 3144895 1088328 87 11
7 7 A002 R051 02-00-00 05-02-11 04:00:00 REGULAR 3144905 1088331 10 3
8 8 A002 R051 02-00-00 05-02-11 08:00:00 REGULAR 3144941 1088420 36 89
9 9 A002 R051 02-00-00 05-02-11 12:00:00 REGULAR 3145094 1088753 153 333
'''
#your code here
return df
Exercise 10 Time to Hour
import pandas
def time_to_hour(time):
'''
Given an input variable time that represents time in the format of:
00:00:00 (hour:minutes:seconds)
Write a function to extract the hour part from the input variable time
and return it as an integer. For example:
1) if hour is 00, your code should return 0
2) if hour is 01, your code should return 1
3) if hour is 21, your code should return 21
Please return hour as an integer.
'''
hour = # your code here
return hour
Exercise 11 Reformat Subway Dates
import datetime
def reformat_subway_dates(date):
'''
The dates in our subway data are formatted in the format month-day-year.
The dates in our weather underground data are formatted year-month-day.
In order to join these two data sets together, we'll want the dates formatted
the same way. Write a function that takes as its input a date in the MTA Subway
data format, and returns a date in the weather underground format.
Hint:
There is a useful function in the datetime library called strptime.
More info can be seen here:
http://docs.python.org/2/library/datetime.html#datetime.datetime.strptime
'''
date_formatted = # your code here
return date_formatted
import pandas
import pandasql
def num_rainy_days(filename):
'''
This function should run a SQL query on a dataframe of
weather data. The SQL query should return one column and
one row - a count of the number of days in the dataframe where
the rain column is equal to 1 (i.e., the number of days it
rained). The dataframe will be titled 'weather_data'. You'll
need to provide the SQL query. You might find SQL's count function
useful for this exercise. You can read more about it here:
https://dev.mysql.com/doc/refman/5.1/en/counting-rows.html
You might also find that interpreting numbers as integers or floats may not
work initially. In order to get around this issue, it may be equal to cast
these numbers as integers. This can be done by writing cast(column as integer).
So for example, if we wanted to cast the maxtempi column as an integer, we would actually
write something like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi = 76.
You can see the weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data = pandas.read_csv(filename)
q = """
your query here
"""
#Execute your SQL command against the pandas frame
rainy_days = pandasql.sqldf(q.lower(), locals())
return rainy_days
Exercise 2 Temp on Foggy and Non Foggy Days
import pandas
import pandasql
def max_temp_aggregate_by_fog(filename):
'''
This function should run a SQL query on a dataframe of
weather data. The SQL query should return two columns and
two rows - whether it was foggy or not (0 or 1) and the max
maxtempi for that fog value (i.e., the maximum max temperature
for both foggy and non-foggy days). The dataframe will be
titled 'weather_data'. You'll need to provide the SQL query.
You might also find that interpreting numbers as integers or floats may not
work initially. In order to get around this issue, it may be useful to cast
these numbers as integers. This can be done by writing cast(column as integer).
So for example, if we wanted to cast the maxtempi column as an integer, we would actually
write something like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi = 76.
You can see the weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data = pandas.read_csv(filename)
q = """
your query here
"""
#Execute your SQL command against the pandas frame
rainy_days = pandasql.sqldf(q.lower(), locals())
return rainy_days
Exercise 3 Mean Temp on Weekends
import pandas
import pandasql
def avg_min_temperature(filename):
'''
This function should run a SQL query on a dataframe of
weather data. The SQL query should return one column and
one row - the average meantempi on days that are a Saturday
or Sunday (i.e., the the average mean temperature on weekends).
The dataframe will be titled 'weather_data' and you can access
the date in the dataframe via the 'date' column.
You'll need to provide the SQL query.
You might also find that interpreting numbers as integers or floats may not
work initially. In order to get around this issue, it may be equal to cast
these numbers as integers. This can be done by writing cast(column as integer).
So for example, if we wanted to cast the maxtempi column as an integer, we would actually
write something like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi = 76.
Also, you can convert dates to days of the week via the 'strftime' keyword in SQL.
For example, cast (strftime('%w', date) as integer) will return 0 if the date
is a Sunday or 6 if the date is a Saturday.
You can see the weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data = pandas.read_csv(filename)
q = """
your query here
"""
#Execute your SQL command against the pandas frame
mean_temp_weekends = pandasql.sqldf(q.lower(), locals())
return mean_temp_weekends
Exercise 4 Mean Temp on Rainy Days
import pandas
import pandasql
def avg_min_temperature(filename):
'''
This function should run a SQL query on a dataframe of
weather data. More specifically you want to find the average
minimum temperature on rainy days where the minimum temperature
is greater than 55 degrees.
You might also find that interpreting numbers as integers or floats may not
work initially. In order to get around this issue, it may be equal to cast
these numbers as integers. This can be done by writing cast(column as integer).
So for example, if we wanted to cast the maxtempi column as an integer, we would actually
write something like where cast(maxtempi as integer) = 76, as opposed to simply
where maxtempi = 76.
You can see the weather data that we are passing in below:
https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
'''
weather_data = pandas.read_csv(filename)
q = """
your query here
"""
#Execute your SQL command against the pandas frame
mean_temp_weekends = pandasql.sqldf(q.lower(), locals())
return mean_temp_weekends
Exercise 5 Fixing NYC Subway Turnstile Data
import csv
def fix_turnstile_data(filenames):
'''
Filenames is a list of MTA Subway turnstile text files. A link to an example
MTA Subway turnstile text file can be seen at the URL below:
http://web.mta.info/developers/data/nyct/turnstile/turnstile_110507.txt
As you can see, there are numerous data points included in each row of the
a MTA Subway turnstile text file.
You want to write a function that will update each row in the text
file so there is only one entry per row. A few examples below:
A002,R051,02-00-00,05-28-11,00:00:00,REGULAR,003178521,001100739
A002,R051,02-00-00,05-28-11,04:00:00,REGULAR,003178541,001100746
A002,R051,02-00-00,05-28-11,08:00:00,REGULAR,003178559,001100775
Write the updates to a different text file in the format of "updated_" + filename.
For example:
1) if you read in a text file called "turnstile_110521.txt"
2) you should write the updated data to "updated_turnstile_110521.txt"
The order of the fields should be preserved.
You can see a sample of the turnstile text file that's passed into this function
and the the corresponding updated file in the links below:
Sample input file:
https://www.dropbox.com/s/mpin5zv4hgrx244/turnstile_110528.txt
Sample updated file:
https://www.dropbox.com/s/074xbgio4c39b7h/solution_turnstile_110528.txt
'''
for name in filenames:
# your code here
Exercise 6 Combining NYC Subway Turnstile Data
def create_master_turnstile_file(filenames, output_file):
'''
Write a function that takes the files in the list filenames, which all have the
columns 'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn', and consolidates
them into one file located at output_file. There should be ONE row with the column
headers, located at the top of the file.
For example, if file_1 has:
'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
line 1 ...
line 2 ...
and another file, file_2 has:
'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
line 3 ...
line 4 ...
line 5 ...
We need to combine file_1 and file_2 into a master_file like below:
'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
line 1 ...
line 2 ...
line 3 ...
line 4 ...
line 5 ...
'''
with open(output_file, 'w') as master_file:
master_file.write('C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn\n')
for filename in filenames:
# your code here
Exercise 7 Filtering "Non Regular" Data
import pandas
def filter_by_regular(filename):
'''
This function should read the csv file located at filename into a pandas dataframe,
and filter the dataframe to only rows where the 'DESCn' column has the value 'REGULAR'.
For example, if the pandas dataframe is as follows:
,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
1,A002,R051,02-00-00,05-01-11,04:00:00,DOOR,3144335,1088159
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
3,A002,R051,02-00-00,05-01-11,12:00:00,DOOR,3144424,1088231
The dataframe will look like below after filtering to only rows where DESCn column
has the value 'REGULAR':
0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
'''
turnstile_data = # your code here
# more of your code here
return turnstile_data
Exercise 8 Get Hourly Entries
import pandas
def get_hourly_entries(df):
'''
The data in the MTA Subway Turnstile data reports on the cumulative
number of entries and exits per row. Assume that you have a dataframe
called df that contains only the rows for a particular turnstile machine
(i.e., unique SCP, C/A, and UNIT). This function should change
these cumulative entry numbers to a count of entries since the last reading
(i.e., entries since the last row in the dataframe).
More specifically, you want to do two things:
1) Create a new column called ENTRIESn_hourly
2) Assign to the column the difference between ENTRIESn of the current row
and the previous row. If there is any NaN, fill/replace it with 1.
You may find the pandas functions shift() and fillna() to be helpful in this exercise.
Examples of what your dataframe should look like at the end of this exercise:
C/A UNIT SCP DATEn TIMEn DESCn ENTRIESn EXITSn ENTRIESn_hourly
0 A002 R051 02-00-00 05-01-11 00:00:00 REGULAR 3144312 1088151 1
1 A002 R051 02-00-00 05-01-11 04:00:00 REGULAR 3144335 1088159 23
2 A002 R051 02-00-00 05-01-11 08:00:00 REGULAR 3144353 1088177 18
3 A002 R051 02-00-00 05-01-11 12:00:00 REGULAR 3144424 1088231 71
4 A002 R051 02-00-00 05-01-11 16:00:00 REGULAR 3144594 1088275 170
5 A002 R051 02-00-00 05-01-11 20:00:00 REGULAR 3144808 1088317 214
6 A002 R051 02-00-00 05-02-11 00:00:00 REGULAR 3144895 1088328 87
7 A002 R051 02-00-00 05-02-11 04:00:00 REGULAR 3144905 1088331 10
8 A002 R051 02-00-00 05-02-11 08:00:00 REGULAR 3144941 1088420 36
9 A002 R051 02-00-00 05-02-11 12:00:00 REGULAR 3145094 1088753 153
10 A002 R051 02-00-00 05-02-11 16:00:00 REGULAR 3145337 1088823 243
...
...
'''
#your code here
return df
Exercise 9 Get Hourly Exists
import pandas
def get_hourly_exits(df):
'''
The data in the MTA Subway Turnstile data reports on the cumulative
number of entries and exits per row. Assume that you have a dataframe
called df that contains only the rows for a particular turnstile machine
(i.e., unique SCP, C/A, and UNIT). This function should change
these cumulative exit numbers to a count of exits since the last reading
(i.e., exits since the last row in the dataframe).
More specifically, you want to do two things:
1) Create a new column called EXITSn_hourly
2) Assign to the column the difference between EXITSn of the current row
and the previous row. If there is any NaN, fill/replace it with 0.
You may find the pandas functions shift() and fillna() to be helpful in this exercise.
Example dataframe below:
Unnamed: 0 C/A UNIT SCP DATEn TIMEn DESCn ENTRIESn EXITSn ENTRIESn_hourly EXITSn_hourly
0 0 A002 R051 02-00-00 05-01-11 00:00:00 REGULAR 3144312 1088151 0 0
1 1 A002 R051 02-00-00 05-01-11 04:00:00 REGULAR 3144335 1088159 23 8
2 2 A002 R051 02-00-00 05-01-11 08:00:00 REGULAR 3144353 1088177 18 18
3 3 A002 R051 02-00-00 05-01-11 12:00:00 REGULAR 3144424 1088231 71 54
4 4 A002 R051 02-00-00 05-01-11 16:00:00 REGULAR 3144594 1088275 170 44
5 5 A002 R051 02-00-00 05-01-11 20:00:00 REGULAR 3144808 1088317 214 42
6 6 A002 R051 02-00-00 05-02-11 00:00:00 REGULAR 3144895 1088328 87 11
7 7 A002 R051 02-00-00 05-02-11 04:00:00 REGULAR 3144905 1088331 10 3
8 8 A002 R051 02-00-00 05-02-11 08:00:00 REGULAR 3144941 1088420 36 89
9 9 A002 R051 02-00-00 05-02-11 12:00:00 REGULAR 3145094 1088753 153 333
'''
#your code here
return df
Exercise 10 Time to Hour
import pandas
def time_to_hour(time):
'''
Given an input variable time that represents time in the format of:
00:00:00 (hour:minutes:seconds)
Write a function to extract the hour part from the input variable time
and return it as an integer. For example:
1) if hour is 00, your code should return 0
2) if hour is 01, your code should return 1
3) if hour is 21, your code should return 21
Please return hour as an integer.
'''
hour = # your code here
return hour
Exercise 11 Reformat Subway Dates
import datetime
def reformat_subway_dates(date):
'''
The dates in our subway data are formatted in the format month-day-year.
The dates in our weather underground data are formatted year-month-day.
In order to join these two data sets together, we'll want the dates formatted
the same way. Write a function that takes as its input a date in the MTA Subway
data format, and returns a date in the weather underground format.
Hint:
There is a useful function in the datetime library called strptime.
More info can be seen here:
http://docs.python.org/2/library/datetime.html#datetime.datetime.strptime
'''
date_formatted = # your code here
return date_formatted
Thursday, June 12, 2014
Lesson 2 Data Wrangling
Data Manipulation
- from files, databases, and Web APIs.
Dealing with messy data. What tools do you use?
Acquiring data
- Acquiring data is often not funcy.
- Find data on the internet.
- A lot of data stored in text files and on government websites.
Common data formats:
- CSV
- XML
- JSON(nested structure)
import pandas
def add_full_name(path_to_csv, path_to_new_csv):
#Assume you will be reading in a csv file with the same columns that the
#Lahman baseball data set has -- most importantly, there are columns
#called 'nameFirst' and 'nameLast'.
#1) Write a function that reads a csv
#located at "path_to_csv" into a pandas dataframe and adds a new column
#called 'nameFull' with a player's full name.
#
#For example:
# for Hank Aaron, nameFull would be 'Hank Aaron',
#
#2) Write the data in the pandas dataFrame to a new csv file located at
#path_to_new_csv
#WRITE YOUR CODE HERE
baseballData = pandas.read_csv(path_to_csv)
baseballData['nameFull'] = baseballData['nameFirst'] + ' ' + baseballData['nameLast']
baseballData.to_csv(path_to_new_csv)
Good job! You updated the CSV file correctly.
The following is the output by your program (showing only the name related columns):
0,Hank,Aaron,Hank Aaron
1,Tommie,Aaron,Tommie Aaron
2,Don,Aase,Don Aase
3,Andy,Abad,Andy Abad
............
Relational Databases:
Why relational database is useful?
- Straight-forward to extract aggregated data from complex files.
- Database scales well.
- It ensures all data is consistently formatted.
Database Schema
schemas=blueprints
Simple Queries
Select * from aadhaar_data;
Limit 20;
Select district subdistrict from aadhaar_data limit 20;
Write your own......
import pandas
import pandasql
def select_first_50(filename):
# Read in our aadhaar_data csv to a pandas dataframe. Afterwards, we rename the columns
# by replacing spaces with underscores and setting all characters to lowercase, so the
# column names more closely resemble columns names one might find in a table.
aadhaar_data = pandas.read_csv(filename)
aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
# Select out the first 50 values for "registrar" and "enrolment_agency"
# in the aadhaar_data table using SQL syntax.
#
# Note that "enrolment_agency" is spelled with one l. Also, the order
# of the select does matter. Make sure you select registrar then enrolment agency
# in your query.
q = """
#YOUR QUERY HERE
SELECT
registrar, enrolment_agency
FROM
aadhaar_data
LIMIT 50;
"""
#Execute your SQL command against the pandas frame
aadhaar_solution = pandasql.sqldf(q.lower(), locals())
return aadhaar_solution
Complex Queries
Select * from aadhaar_data
where state='Gujarat';
Select district, sum(aadhaar_generated)
from aadhaar_data group by district;
Select district, subdistrict sum(aadhaar_generated)
from aadhaar_data where age>60 group by district, subdistrict;
Write your own....
import pandas
import pandasql
def aggregate_query(filename):
# Read in our aadhaar_data csv to a pandas dataframe. Afterwards, we rename the columns
# by replacing spaces with underscores and setting all characters to lowercase, so the
# column names more closely resemble columns names one might find in a table.
aadhaar_data = pandas.read_csv(filename)
aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
# Write a query that will select from the aadhaar_data table how many men and how
# many women over the age of 50 have had aadhaar generated for them in each district
#
# Note that in this quiz, the SQL query keywords are case sensitive.
# For example, if you want to do a sum make sure you type 'sum' rather than 'SUM'.
#
# The possible columns to select from aadhaar data are:
# 1) registrar
# 2) enrolment_agency
# 3) state
# 4) district
# 5) sub_district
# 6) pin_code
# 7) gender
# 8) age
# 9) aadhaar_generated
# 10) enrolment_rejected
# 11) residents_providing_email,
# 12) residents_providing_mobile_number
#
# You can download a copy of the aadhaar data that we are passing
# into this exercise below:
# https://www.dropbox.com/s/vn8t4uulbsfmalo/aadhaar_data.csv
q = """# your code here
select gender, district, sum(aadhaar_generated)
from
aadhaar_data
where
age > 50
group by
gender, district;
"""
# Execute your SQL command against the pandas frame
aadhaar_solution = pandasql.sqldf(q.lower(), locals())
return aadhaar_solution
APIs
Files, Databases, APIs
Application Programming Interface
How to Access API Efficiently
API Exercise:
import json
# return the top artist in Spain
1. Occasionally system error prevent data from being recorded.
2. Some subset of subjects or event types are systematically missing certain data attributes, or missing entirely.
Occasional failure, nonresponse
Dealing with missing data
- Partial Deletion
(1) Listwise deletion
(2) Pairwise deletion
- Imputation
why impute?
- Not much data
- removing data could affect representatives
- many different imputation techniques
- imputation is really hard to get right
- how to deal with
(1) use mean value, good- doesn't change mean across sample; bad - lessens correlations between variables.
(2) impute using linear regression,
data we have ->make an equation to predict variable missing -> predict missing values
Drawbacks: Over emphasize trends; Exact value suggest too much certainty
Imputaton Exercise:
from pandas import *
import numpy
def imputation(filename):
# Pandas dataframes have a method called 'fillna(value)', such that you can
# pass in a single value to replace any NAs in a dataframe or series. You
# can call it like this:
# dataframe['column'] = dataframe['column'].fillna(value)
#
# Using the numpy.mean function, which calculates the mean of a numpy
# array, impute any missing values in our Lahman baseball
# data sets 'weight' column by setting them equal to the average weight.
#
# You can access the 'weight' colum in the baseball data frame by
# calling baseball['weight']
#baseball = pandas.read_csv(filename)
#YOUR CODE GOES HERE
baseball=pandas.read_csv('../data/Master.csv')
baseball['weight'] = baseball['weight'].fillna(numpy.mean(baseball['weight']))
print numpy.sum(baseball['weight']), numpy.mean(baseball['weight'])
return baseball
Imputation:
- just the tip of the iceberg; more sophisticated methods exist;
- fill in mean and
- linear regression , both simple and relatively effective;
- both have negative side effects and can obscure amplify the trends in data;
Assignment #2
- Acquire weather data via weather underground API
- Get a sense of data using SQL queries
- Clean and process the data
- from files, databases, and Web APIs.
Dealing with messy data. What tools do you use?
Acquiring data
- Acquiring data is often not funcy.
- Find data on the internet.
- A lot of data stored in text files and on government websites.
Common data formats:
- CSV
- XML
- JSON(nested structure)
import pandas
def add_full_name(path_to_csv, path_to_new_csv):
#Assume you will be reading in a csv file with the same columns that the
#Lahman baseball data set has -- most importantly, there are columns
#called 'nameFirst' and 'nameLast'.
#1) Write a function that reads a csv
#located at "path_to_csv" into a pandas dataframe and adds a new column
#called 'nameFull' with a player's full name.
#
#For example:
# for Hank Aaron, nameFull would be 'Hank Aaron',
#
#2) Write the data in the pandas dataFrame to a new csv file located at
#path_to_new_csv
#WRITE YOUR CODE HERE
baseballData = pandas.read_csv(path_to_csv)
baseballData['nameFull'] = baseballData['nameFirst'] + ' ' + baseballData['nameLast']
baseballData.to_csv(path_to_new_csv)
Good job! You updated the CSV file correctly.
The following is the output by your program (showing only the name related columns):
0,Hank,Aaron,Hank Aaron
1,Tommie,Aaron,Tommie Aaron
2,Don,Aase,Don Aase
3,Andy,Abad,Andy Abad
............
Relational Databases:
Why relational database is useful?
- Straight-forward to extract aggregated data from complex files.
- Database scales well.
- It ensures all data is consistently formatted.
Database Schema
schemas=blueprints
Simple Queries
Select * from aadhaar_data;
Limit 20;
Select district subdistrict from aadhaar_data limit 20;
Write your own......
import pandas
import pandasql
def select_first_50(filename):
# Read in our aadhaar_data csv to a pandas dataframe. Afterwards, we rename the columns
# by replacing spaces with underscores and setting all characters to lowercase, so the
# column names more closely resemble columns names one might find in a table.
aadhaar_data = pandas.read_csv(filename)
aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
# Select out the first 50 values for "registrar" and "enrolment_agency"
# in the aadhaar_data table using SQL syntax.
#
# Note that "enrolment_agency" is spelled with one l. Also, the order
# of the select does matter. Make sure you select registrar then enrolment agency
# in your query.
q = """
#YOUR QUERY HERE
SELECT
registrar, enrolment_agency
FROM
aadhaar_data
LIMIT 50;
"""
#Execute your SQL command against the pandas frame
aadhaar_solution = pandasql.sqldf(q.lower(), locals())
return aadhaar_solution
Complex Queries
Select * from aadhaar_data
where state='Gujarat';
Select district, sum(aadhaar_generated)
from aadhaar_data group by district;
Select district, subdistrict sum(aadhaar_generated)
from aadhaar_data where age>60 group by district, subdistrict;
Write your own....
import pandas
import pandasql
def aggregate_query(filename):
# Read in our aadhaar_data csv to a pandas dataframe. Afterwards, we rename the columns
# by replacing spaces with underscores and setting all characters to lowercase, so the
# column names more closely resemble columns names one might find in a table.
aadhaar_data = pandas.read_csv(filename)
aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
# Write a query that will select from the aadhaar_data table how many men and how
# many women over the age of 50 have had aadhaar generated for them in each district
#
# Note that in this quiz, the SQL query keywords are case sensitive.
# For example, if you want to do a sum make sure you type 'sum' rather than 'SUM'.
#
# The possible columns to select from aadhaar data are:
# 1) registrar
# 2) enrolment_agency
# 3) state
# 4) district
# 5) sub_district
# 6) pin_code
# 7) gender
# 8) age
# 9) aadhaar_generated
# 10) enrolment_rejected
# 11) residents_providing_email,
# 12) residents_providing_mobile_number
#
# You can download a copy of the aadhaar data that we are passing
# into this exercise below:
# https://www.dropbox.com/s/vn8t4uulbsfmalo/aadhaar_data.csv
q = """# your code here
select gender, district, sum(aadhaar_generated)
from
aadhaar_data
where
age > 50
group by
gender, district;
"""
# Execute your SQL command against the pandas frame
aadhaar_solution = pandasql.sqldf(q.lower(), locals())
return aadhaar_solution
APIs
Files, Databases, APIs
Application Programming Interface
How to Access API Efficiently
API Exercise:
import json
import requests
def api_get_request(url):
# In this exercise, you want to call the last.fm API to get a list of the
# top artists in Spain.
#
# Once you've done this, return the name of the number 1 top artist in Spain.
url = 'http://ws.audioscrobbler.com/2.0/?method=geo.gettopartists&country=spain&api_key=0a5de916777ad83d6de29347308556a1&format=json'
data = requests.get(url).text
data = json.loads(data)
print data['topartists']['artist'][0]['name']
Sanity Checking Data
Missing Values - why are they missing?
Does data make sense?
Is there a problem?
Does the data look like I expect it to?
baseball=pandas.read_csv('../data/master.csv')
baseball.describe()
Count, Mean, Standard Deviation, Minimum, Maximum , 25%, 50%, 75%
1. Occasionally system error prevent data from being recorded.
2. Some subset of subjects or event types are systematically missing certain data attributes, or missing entirely.
Occasional failure, nonresponse
Dealing with missing data
- Partial Deletion
(1) Listwise deletion
(2) Pairwise deletion
- Imputation
why impute?
- Not much data
- removing data could affect representatives
- many different imputation techniques
- imputation is really hard to get right
- how to deal with
(1) use mean value, good- doesn't change mean across sample; bad - lessens correlations between variables.
(2) impute using linear regression,
data we have ->make an equation to predict variable missing -> predict missing values
Drawbacks: Over emphasize trends; Exact value suggest too much certainty
Imputaton Exercise:
from pandas import *
import numpy
def imputation(filename):
# Pandas dataframes have a method called 'fillna(value)', such that you can
# pass in a single value to replace any NAs in a dataframe or series. You
# can call it like this:
# dataframe['column'] = dataframe['column'].fillna(value)
#
# Using the numpy.mean function, which calculates the mean of a numpy
# array, impute any missing values in our Lahman baseball
# data sets 'weight' column by setting them equal to the average weight.
#
# You can access the 'weight' colum in the baseball data frame by
# calling baseball['weight']
#baseball = pandas.read_csv(filename)
#YOUR CODE GOES HERE
baseball=pandas.read_csv('../data/Master.csv')
baseball['weight'] = baseball['weight'].fillna(numpy.mean(baseball['weight']))
print numpy.sum(baseball['weight']), numpy.mean(baseball['weight'])
return baseball
Imputation:
- just the tip of the iceberg; more sophisticated methods exist;
- fill in mean and
- linear regression , both simple and relatively effective;
- both have negative side effects and can obscure amplify the trends in data;
Assignment #2
- Acquire weather data via weather underground API
- Get a sense of data using SQL queries
- Clean and process the data
Project 1 Titanic Data
Simple Heuristic
import numpy
import pandas
import statsmodels.api as sm
def simple_heuristic(file_path):
'''
In this exercise, we will perform some rudimentary practices similar to those of
an actual data scientist.
Part of a data scientist's job is to use her or his intuition and insight to
write algorithms and heuristics. A data scientist also creates mathematical models
to make predictions based on some attributes from the data that they are examining.
We would like for you to take your knowledge and intuition about the Titanic
and its passengers' attributes to predict whether or not the passengers survived
or perished. You can read more about the Titanic and specifics about this dataset at:
http://en.wikipedia.org/wiki/RMS_Titanic
http://www.kaggle.com/c/titanic-gettingStarted
In this exercise and the following ones, you are given a list of Titantic passengers
and their associated information. More information about the data can be seen at the
link below:
http://www.kaggle.com/c/titanic-gettingStarted/data.
For this exercise, you need to write a simple heuristic that will use
the passengers' gender to predict if that person survived the Titanic diaster.
You prediction should be 78% accurate or higher.
Here's a simple heuristic to start off:
1) If the passenger is female, your heuristic should assume that the
passenger survived.
2) If the passenger is male, you heuristic should
assume that the passenger did not survive.
You can access the gender of a passenger via passenger['Sex'].
If the passenger is male, passenger['Sex'] will return a string "male".
If the passenger is female, passenger['Sex'] will return a string "female".
Write your prediction back into the "predictions" dictionary. The
key of the dictionary should be the passenger's id (which can be accessed
via passenger["PassengerId"]) and the associated value should be 1 if the
passenger survied or 0 otherwise.
For example, if a passenger is predicted to have survived:
passenger_id = passenger['PassengerId']
predictions[passenger_id] = 1
And if a passenger is predicted to have perished in the disaster:
passenger_id = passenger['PassengerId']
predictions[passenger_id] = 0
You can also look at the Titantic data that you will be working with
at the link below:
https://www.dropbox.com/s/r5f9aos8p9ri9sa/titanic_data.csv
'''
predictions = {}
df = pandas.read_csv(file_path)
for passenger_index, passenger in df.iterrows():
#
# Your code here
#
return predictions
import numpy
import pandas
import statsmodels.api as sm
def simple_heuristic(file_path):
'''
In this exercise, we will perform some rudimentary practices similar to those of
an actual data scientist.
Part of a data scientist's job is to use her or his intuition and insight to
write algorithms and heuristics. A data scientist also creates mathematical models
to make predictions based on some attributes from the data that they are examining.
We would like for you to take your knowledge and intuition about the Titanic
and its passengers' attributes to predict whether or not the passengers survived
or perished. You can read more about the Titanic and specifics about this dataset at:
http://en.wikipedia.org/wiki/RMS_Titanic
http://www.kaggle.com/c/titanic-gettingStarted
In this exercise and the following ones, you are given a list of Titantic passengers
and their associated information. More information about the data can be seen at the
link below:
http://www.kaggle.com/c/titanic-gettingStarted/data.
For this exercise, you need to write a simple heuristic that will use
the passengers' gender to predict if that person survived the Titanic diaster.
You prediction should be 78% accurate or higher.
Here's a simple heuristic to start off:
1) If the passenger is female, your heuristic should assume that the
passenger survived.
2) If the passenger is male, you heuristic should
assume that the passenger did not survive.
You can access the gender of a passenger via passenger['Sex'].
If the passenger is male, passenger['Sex'] will return a string "male".
If the passenger is female, passenger['Sex'] will return a string "female".
Write your prediction back into the "predictions" dictionary. The
key of the dictionary should be the passenger's id (which can be accessed
via passenger["PassengerId"]) and the associated value should be 1 if the
passenger survied or 0 otherwise.
For example, if a passenger is predicted to have survived:
passenger_id = passenger['PassengerId']
predictions[passenger_id] = 1
And if a passenger is predicted to have perished in the disaster:
passenger_id = passenger['PassengerId']
predictions[passenger_id] = 0
You can also look at the Titantic data that you will be working with
at the link below:
https://www.dropbox.com/s/r5f9aos8p9ri9sa/titanic_data.csv
'''
predictions = {}
df = pandas.read_csv(file_path)
for passenger_index, passenger in df.iterrows():
#
# Your code here
#
return predictions
Tuesday, June 10, 2014
Lesson 1 Introduction
What is data scientist? What do you think data scientist do in their day-to-day work?
A: A data scientists day to day job may vary from company to company and from industry to industry. There is really no right or wrong answer.
Hacking skills, Math & Statistics Knowledge, Substantive Expertise. Data Science is the center of three fields.
Basic Data Scientist skills: What does it mean a data scientist to have "substantive expertise" and why is it important?
A: It is important for data scientists to have substantive expertise so they know what questions to ask and investigate, as well as how to interpret the data under the right context.
How can we solve real world problems with data science?
A: Data science can solve problems you'll expect: Netflix, Social Media, Web Apps (OkCupid, Uber, etc.) And a ton more you might not expect: Bioinformatics, Urban Planning, Astrophysics, Public Health.
Pandas: R + Python = Pandas, 1. useful for manipulating data, 2. R-lite elements in Python.
Dataframes: name, age, fare, survived.
Create a new dataframe
Assignment: Titanic, predict who will survive?
Class Project: Analyzing subway ridership, weather underground data
A: A data scientists day to day job may vary from company to company and from industry to industry. There is really no right or wrong answer.
Hacking skills, Math & Statistics Knowledge, Substantive Expertise. Data Science is the center of three fields.
Basic Data Scientist skills: What does it mean a data scientist to have "substantive expertise" and why is it important?
A: It is important for data scientists to have substantive expertise so they know what questions to ask and investigate, as well as how to interpret the data under the right context.
How can we solve real world problems with data science?
A: Data science can solve problems you'll expect: Netflix, Social Media, Web Apps (OkCupid, Uber, etc.) And a ton more you might not expect: Bioinformatics, Urban Planning, Astrophysics, Public Health.
Pandas: R + Python = Pandas, 1. useful for manipulating data, 2. R-lite elements in Python.
Dataframes: name, age, fare, survived.
Create a new dataframe
Assignment: Titanic, predict who will survive?
Class Project: Analyzing subway ridership, weather underground data
Subscribe to:
Posts (Atom)