BUS 597 Directed Study - Introduction to Analytics: 2014

Monday, September 1, 2014

Google Analytics Platform Principles

Instructor Justin Cutroni introduces the topics about the Google Analytics Platform

Unit 1 - Platform Fundamentals
Course overview

The platform components
Collection
For example, you’ll use the JavaScript tracking code to collect data from a website, but a Software Development Kit, called an SDK, to collect data from a mobile app.

Processing and Configuration
You can also configure Google Analytics to import data directly into your reports from other Google products, like Google AdWords, Google AdSense and Google Webmaster Tools. You can even configure Google Analytics to import data from non-Google sources, like your own internal data.

Reporting
Using the API you can build your own reporting tools or extract your data directly into third-party reporting tools.

The data model

Users: visitors
Sessions
Interactions: Individual interactions are called hits.

Each interaction that Google Analytics tracks belongs to a session, and each session is associated with a user.

Unit 2 - Collection
Data collection overview

it’s the tracking code that gathers and sends the data back to your account for reporting
Using hits to collect and send data
In the example, everything after the question mark is called a parameter. Each parameter carries a piece of information back to Google’s analytic servers.
utmul=, which shows the language that the user’s browser is set to.
In addition to creating hits, the tracking code also performs another critical function. It identifies new users and returning users.

<script>

(function(i,s,o,g,r,a,m){i[GoogleAnalyticsObject]=r;i[r]=i[r]||function(){

(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),

m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)

})(window,document,script,//www.google-analytics.com/analytics.js,ga);

ga(create, UA-12345-6, auto);
ga(send, pageview);

</script>

<!-- End Google Analytics
For your own account, you can find the tracking ID in the account administrative settings.

Website data collection
Understand how the Google Analytics JavaScript website tracking code collects data.

To track data from a website, Google Analytics provides a standard snippet of JavaScript tracking code. This snippet references a JavaScript library calledanalytics.js that controls what data is collected.

Adding the Google Analytics JavaScript code to your website

Simply add the standard code snippet before the closing </head> tag in the HTML of every web page you want to track.

Functions of the web tracking code

The Google Analytics tracking code executes JavaScript asynchronously, meaning that the JavaScript runs in the background while the browser performs other tasks.

Mobile app data collection

Using the Google Analytics mobile SDKs

Instead of using JavaScript to collect data like you do on a website, youll use an SDK, or Software Development Kit, to collect data from your mobile app.

Dispatching

When a user navigates through an app, the Google Analytics SDK stores the hits locally on the device and then sends them to your Google Analytics account later in a batch process called dispatching.
For these reasons, the SDKs automatically dispatch hits every 30 minutes for Android devices and every two minutes for iOS devices,

The mobile SDKs provide a simple way to track user activity from an app, and collect most of the data you’ll need without any customization. But keep in mind, there are many ways to modify your code to collect additional information about your users, their sessions and their interactions with your app.

Measurement Protocol data collection
what if you want to collect data from some other kind of device? For example, you might want to track a point-of-sale system or user activity on a kiosk.

Collecting and sending data with the Measurement Protocol

For instance, let’s say we want to collect data from a kiosk. Here’s a sample hit that will track when a user views a screen on the kiosk:
http://www.google-analytics.com/collect?v=1&tid=UA-XXXX-Y&cid=555&sr=800x600&t=pageview&dh=mydemo.com&dp=/home&dt=homepage
Notice there is a parameter in the hit that contains the screen resolution. This particular parameter will become the dimension Screen Resolution during processing.

With the Measurement Protocol, you can use Google Analytics to collect data from any kind of device.

Unit 3 - Processing & Configuration
Processing & configuration overview

Processing data and applying your configuration settings

Google Analytics organizes the hits you’ve collected into users and sessions.
data from other sources can be joined with data collected via the tracking code. For example, you can configure Google Analytics to import data from Google AdWords, Google AdSense or Google Webmaster Tools.
Google Analytics processing will modify your data according to any configuration rules you’ve added.
the data goes through a process called “aggregation.”

Understanding how Google Analytics transforms raw data during processing, and how your configuration settings can control what happens during processing, will help you better interpret and manage the data in your reports.

Processing hits into sessions & users

How hits are organized by users

Google Analytics creates a random, unique ID that is associated with the device. Each unique ID is considered to be a unique user in Google Analytics.
It’s possible for these IDs to get reset or erased. This happens if a user clears their cookies in a web browser, or uninstalls and then reinstalls a mobile app.

How hits are organized into sessions

A session in Google Analytics is a collection of interactions, or hits, from a specific user during a defined period of time.
By default, a session ends after 30 minutes of inactivity.

Users and sessions are a critical part of the digital analytics data model. All of the reports you see in Google Analytics depend on this model to organize the data.

Importing data into Google Analytics
The most common way to get data into Google Analytics is through your tracking code, but you can also add data from other sources.

Importing data into Google Analytics

Account linking

You can link various Google products directly to Google Analytics via your account settings. This includes:

Google AdWords
Google AdSense
Google Webmaster Tools

Data Import

There are two ways to import data into Google Analytics:

(1). Dimension Widening

if you’re a publisher you might want to segment your data based on the author and topic of your online articles.
With Dimension Widening, you could import author and topic as new dimensions for your content pages. You could use each article’s page URL as the “key” that links the new data to your existing Google Analytics data.
You can add data using Dimension Widening either by uploading a file or by using the Google Analytics APIs. Uploading a file, like a spreadsheet or .CSV, is easy, but it can be time consuming if you need add data often. To save time, you can build a program that uses the APIs to automatically send data into Google Analytics on a regular basis.

(2). Cost Data Import

You use this feature specifically to add data that shows the amount of money you spent on your non-Google advertising. Importing cost data lets Google Analytics calculate the return-on-investment of your non-Google ads.

Although you’ll collect most of your data using the tracking code, account linking and data import are two powerful ways to add more context to your Google Analytics data. By choosing the right data sources to link or import into Google Analytics, you can better measure the performance of your business.

Transforming & aggregating data

The role of configuration settings during processing

including data, excluding data, or modifying how data appears in a reporting View
Filters, Goals, and Grouping.

Common configuration settings: Filters

Filters provide a flexible way you can modify the data within each view.
Example, create a filter to exclude traffic from a particular IP address or to convert messy page URLs into readable text.

Common configuration settings: Goals

Goals let you specify which pageviews, screen views or other hits should be used to calculate conversions.

Common configuration settings: Channel Grouping and Content Grouping

aggregate certain pieces of data together so you can analyze the collective performance.
You can create two types of groups in Google Analytics: Channel groups and Content groups.
A Channel Group is a collection of common marketing activities. For example, Display Advertising, Social media, Email marketing, and Paid Search are four common channel groups that are each a roll-up of several marketing activities.
Content Groups are like Channel Groups, except you use them to create and analyze a collection of content.

Data aggregation

During aggregation, Google Analytics creates and organizes your report dimensions into tables, called aggregate tables.

Unit 4 - Reporting

Reporting overview

All Google Analytics reports are based on different combinations of dimensions and metrics.
In addition to the reporting interface, you can use an API, or an Application Programming Interface, to extract your data directly from Google Analytics.
Most of the time, when you request data from the reporting interface or the APIs you’ll receive your data almost immediately. But in some cases, where you request complex data, Google Analytics uses a process called sampling.

Building reports with dimensions & metrics
Understand how metrics and dimensions are combined to build reports and which dimensions and metrics can be used together

Dimensions in Google Analytics

a dimension of a session is the traffic source that brought the user to your site.

Metrics in Google Analytics

Metrics are the quantitative measurements of your data. They count how often things happen, like the total number of users on a website or app.

Combining dimensions and metrics in reports

Most commonly, you’ll see dimensions and metrics reported in a table, with the first column containing the values for one particular dimension, and the rest of the columns displaying the corresponding metrics.

The reporting APIs

To use the reporting APIs, you have to build your own application. This application needs to be able to write and send a query to the reporting API. The API uses the query to retrieve data from the aggregate tables, and then sends a response back to your application containing the data that was requested.
the reporting APIs give you the power to automate and streamline complex reporting tasks for your business.

Report sampling

When does sampling happen?
Google Analytics checks to see how many sessions should be included in your request. If the number of sessions is small enough, Google Analytics can calculate the data for your request using all of the sessions. If the number of sessions is too large, Google Analytics uses a sample to fulfill the request
Adjusting the sample size
The sampling limit
Google Analytics sets a maximum number of sessions that can be used to calculate your reports. If you go over that limit, your data gets sampled.

Final Assessment

Project 5 - MapReduce on NYC Subway and Weather Data

import sys
import string
import logging

from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
level=logging.INFO, filemode='w')

def mapper():
"""
The input to this mapper will be the final Subway-MTA dataset, the same as
in the previous exercise. You can check out the csv and its structure below:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv

For each line of input, the mapper output should PRINT (not return) the UNIT as
the key, the number of ENTRIESn_hourly as the value, and separate the key and
the value by a tab. For example: 'R002\t105105.0'

Since you are printing the output of your program, printing a debug
statement will interfere with the operation of the grader. Instead,
use the logging module, which we've configured to log to a file printed
when you click "Test Run". For example:
logging.info("My debugging message")

The logging module can be used to give you more control over your debugging
or other messages than you can get by printing them. In this exercise, print
statements from your mapper will go to your reducer, and print statements
from your reducer will be considered your final output. By contrast, messages
logged via the loggers we configured will be saved to two files, one
for the mapper and one for the reducer. If you click "Test Run", then we
will show the contents of those files once your program has finished running.
The logging module also has other capabilities; see
https://docs.python.org/2/library/logging.html for more information.
"""

for line in sys.stdin:
# your code here

mapper()

import sys

import logging

from util import reducer_logfile

logging.basicConfig(filename=reducer_logfile, format='%(message)s',

level=logging.INFO, filemode='w')

def reducer():

'''

Given the output of the mapper for this exercise, the reducer should PRINT

(not return) one line per UNIT along with the total number of ENTRIESn_hourly

over the course of May (which is the duration of our data), separated by a tab.

An example output row from the reducer might look like this: 'R001\t500625.0'

You can assume that the input to the reducer is sorted such that all rows

corresponding to a particular UNIT are grouped together.

Since you are printing the output of your program, printing a debug

statement will interfere with the operation of the grader. Instead,

use the logging module, which we've configured to log to a file printed

when you click "Test Run". For example:

logging.info("My debugging message")

'''

for line in sys.stdin:

# your code here

reducer()

Lesson 5 - MapReduce

Big Data and MapReduce
Catalog and index all books in the world.

Scenarios for MapReduce
In which of the situations below do you think mapreduce may have been used?
1. discover new oil resources
2. power an e-commerce website.
3. identify the malware and cyber attack patterns for online security.
4. helps doctors answer questions about patients' health.

Basics of MapReduce

Counting words serially

import logging
import sys
import string

from util import logfile

logging.basicConfig(filename=logfile, format='%(message)s',
level=logging.INFO, filemode='w')

def word_count():
# For this exercise, write a program that serially counts the number of occurrences
# of each word in the book Alice in Wonderland.
#
# The text of Alice in Wonderland will be fed into your program line-by-line.
# Your program needs to take each line and do the following:
# 1) Tokenize the line into string tokens by whitespace
# Example: "Hello, World!" should be converted into "Hello," and "World!"
# (This part has been done for you.)
#
# 2) Remove all punctuation
# Example: "Hello," and "World!" should be converted into "Hello" and "World"
#
# 3) Make all letters lowercase
# Example: "Hello" and "World" should be converted to "hello" and "world"
#
# Store the the number of times that a word appears in Alice in Wonderland
# in the word_counts dictionary, and then *print* (don't return) that dictionary
#
# In this exercise, print statements will be considered your final output. Because
# of this, printing a debug statement will cause the grader to break. Instead,
# you can use the logging module which we've configured for you.
#
# For example:
# logging.info("My debugging message")
#
# The logging module can be used to give you more control over your
# debugging or other messages than you can get by printing them. Messages
# logged via the logger we configured will be saved to a
# file. If you click "Test Run", then you will see the contents of that file
# once your program has finished running.
#
# The logging module also has other capabilities; see
# https://docs.python.org/2/library/logging.html
# for more information.

word_counts = {}

for line in sys.stdin:
data = line.strip().split(" ")

# Your code here
for i in data:

     key = i.translate(string.maketrans("","",string.punctuation).lower()

    if key in word_counts.key():

        word_counts[key] +1=1

    else

        word_counts[key] =1

print word_counts

word_count()

Counting words in MapReduce

Mapper stage

Reducer stage

Using MapReduce with Aadhar Data

import sys
import string
import logging

from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
level=logging.INFO, filemode='w')

def mapper():

#Also make sure to fill out the reducer code before clicking "Test Run" or "Submit".

#Each line will be a comma-separated list of values. The
#header row WILL be included. Tokenize each row using the
#commas, and emit (i.e. print) a key-value pair containing the
#district (not state) and Aadhaar generated, separated by a tab.
#Skip rows without the correct number of tokens and also skip
#the header row.

#You can see a copy of the the input Aadhaar data
#in the link below:
#https://www.dropbox.com/s/vn8t4uulbsfmalo/aadhaar_data.csv

#Since you are printing the output of your program, printing a debug
#statement will interfere with the operation of the grader. Instead,
#use the logging module, which we've configured to log to a file printed
#when you click "Test Run". For example:
#logging.info("My debugging message")

for line in sys.stdin:
data = line.strip().split(" ")
#your code here

    if len(data) !=12 or data[0]=='Registrar':

           continue

    print {0}\t{1}.format(data[3],data[8])

mapper()

import sys

import logging

from util import reducer_logfile

logging.basicConfig(filename=reducer_logfile, format='%(message)s',

level=logging.INFO, filemode='w')

def reducer():

#Also make sure to fill out the mapper code before clicking "Test Run" or "Submit".

#Each line will be a key-value pair separated by a tab character.

#Print out each key once, along with the total number of Aadhaar

#generated, separated by a tab. Make sure each key-value pair is

#formatted correctly! Here's a sample final key-value pair: 'Gujarat\t5.0'

#Since you are printing the output of your program, printing a debug

#statement will interfere with the operation of the grader. Instead,

#use the logging module, which we've configured to log to a file printed

#when you click "Test Run". For example:

#logging.info("My debugging message")

aadhaar_generated=0

old_key=none

for line in sys.stdin:

data = line.strip().split("\t ")

# your code here

    if len(data) !=2

           continue

    this_key.count=data

    if old_key and

    old_key != this_key:

    print {0}\t{1}.format(old_key, aadhaar_generated)

    aadhaar_generated=0

  old_key=this_key

aadhaar_generated += float(count)

if old_key != None:

   print {0}\t{1}.format(old_key, aadhaar_generated)

reducer()

More Complex MapReduce
How do we do more complex thing with MapReduce?
- counting word
- aggregate aadhaar generated

MapReduce EcoSystem
MapReduce Programming Model
Haodoop:
hive - facebook
pig - yahoo

Using MapReduce with Subway Data

Project 4 Visualize NYC Subway and Weather Data

from pandas import *
from ggplot import *

def plot_weather_data(turnstile_weather):
'''
plot_weather_data is passed a dataframe called turnstile_weather.
Use turnstile_weather along with ggplot to make a data visualization
focused on the MTA and weather data we used in Project 3.

You should feel free to implement something that we discussed in class
(e.g., scatterplots, line plots, or histograms) or attempt to implement
something more advanced if you'd like.

Here are some suggestions for things to investigate and illustrate:
* Ridership by time-of-day or day-of-week
* How ridership varies by subway station
* Which stations have more exits or entries at different times of day

If you'd like to learn more about ggplot and its capabilities, take
a look at the documentation at:
https://pypi.python.org/pypi/ggplot/

You can check out the link
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
to see all the columns and data points included in the turnstile_weather
dataframe.

However, due to the limitation of our Amazon EC2 server, we will give you only
about 1/3 of the actual data in the turnstile_weather dataframe.
'''

plot = # your code here
return plot

Lesson 4 - Data Visualization

Show your findings in visualization, in Python.

What is information visualization?

1. Effective communication of complex quantitative ideas.

- clarity, precision, efficiency.

2. Helps you notice things about data (correlations, trends, etc.) that might go unnoticed.
3. Can highlight aspects of data, or "tell a story".

Napoleon's March on Russian.

What information is depicted in this visualization?
size of army, location of army, direction of army, temperature or various dates during the retreat.

Components of effective visualization:
1.visual cues
2. coordinate systems
3. scale / data types
4. context

Mathematical and statistical rigor. Tell a story.

Visual Encoding (cues)
Position, Length, Angle, Direction, Shape, Area/volume, Color (hue, saturation, combination, limit hue).

Perception of visual cues: 1, position, 2. angle, 3. area, 4. saturation.
1985 AT&T Lab paper on graphical perception:
position, length, angle, direction, area, volumes, saturation, hue.
Hue and saturation least accurate.

Plotting in Python:
Many packages:
1. Matplotlib
2. ggplot

We use ggplot, why? looks nicer, grammar of graphics,
First step: create plot.
Second step: represent data with geometric objects.
Third step: add labels.
print ggplot (data, aes(xvar, yvar))+ geom_point(color='coral')+ geom_line(color='coral')+ ggplot('title')+ xlab('x-lab')+ ylab('y-lab')

from pandas import *
from ggplot import *

import pandas

def lineplot(hr_year_csv):
# A csv file will be passed in as an argument which
# contains two columns -- 'HR' (the number of homerun hits)
# and 'yearID' (the year in which the homeruns were hit).
#
# Fill out the body of this function, lineplot, to use the
# passed-in csv file, hr_year.csv, and create a
# chart with points connected by lines, both colored 'red',
# showing the number of HR by year.
#
# You will want to first load the csv file into a pandas dataframe
# and use the pandas dataframe along with ggplot to create your visualization
#
# You can check out the data in the csv file at the link below:
# https://www.dropbox.com/s/awgdal71hc1u06d/hr_year.csv
#
# You can read more about ggplot at the following link:
# https://github.com/yhat/ggplot/

gg =
hr_year = pandas.read_csv('hr_year.csv')
print ggplot(hr_year, aes('yearID', 'HR')) + geom_point(color='red')+ geom_line(color='red') + ggtitle('Total HRs by Year') + xlab('Year') + ylab('HR')

return gg

Data Type - Numeric Data
a measurement (e.g. height, weight) and a count (e.g. HR or hits)
discrete and continuous:
discrete - a certain numbers, whole number (10, 34, 25)
continuous - any numbers in a range (e.g. 0.25, 0.357, 0.511 batting rate)

Data Type - Categorical Data
represent characteristic data (e.g. position, team, hometown, handedness)
can take numerical values but they don't have mathematical meaning.
ordinal data - category with some order or ranking power.
Very Low, Low, High, Very High.

Data Type - Time Series Data
Data collected via repeated measurements over time.
Example: Average HR/player over many years.

Scales
Categorical
HR vs. Position
HR vs. Months/Years

Improper use of scales
When use incorrectly, scales can misguide or confuse readers.

Plotting Line Chart
from pandas import *
from ggplot import *

import pandas

def lineplot_compare(hr_by_team_year_sf_la_csv):
# Write a function, lineplot_compare, that will read a csv file
# called hr_by_team_year_sf_la.csv and plot it using pandas and ggplot2.
#
# This csv file has three columns: yearID, HR, and teamID. The data in the
# file gives the total number of home runs hit each year by the SF Giants
# (teamID == 'SFN') and the LA Dodgers (teamID == "LAN"). Produce a
# visualization comparing the total home runs by year of the two teams.
#
# You can see the data in hr_by_team_year_sf_la_csv
# at the link below:
# https://www.dropbox.com/s/wn43cngo2wdle2b/hr_by_team_year_sf_la.csv
#
# Note that to differentiate between multiple categories on the
# same plot in ggplot, we can pass color in with the other arguments
# to aes, rather than in our geometry functions. For example,
# ggplot(data, aes(xvar, yvar, color=category_var)). This should help you
# in this exercise.

gg = #YOUR CODE GOES HERE

hr_year = pandas.read_csv('hr_by_team_year_sf_la.csv')
print ggplot(hr_year, aes('yearID', 'HR', color='teamID')) + geom_point()+ geom_line() + ggtitle('Total HRs by Year') + xlab('Year') + ylab('HR')

if _name__== '__main__':
lineplot_compare()

return gg

Visualizing Time Series Data
- Format for MTA + Weather Data
- Scatter Plot vs. Additional Bells & Whistle (e.g. LOESS curve)
- Boston Red Sox Winning Percentage, 1960 - 2010

Scatter Plot

Line Chart
- mitigate some shortening of scatter plot
- emphasize the trends
- focus on the year to year variability, not overall trends.

LOESS Curve
- emphasize long term trends
- LOESS weighted regression
- easier to take a quick look at chart and understand big picture.

Multivariate
- how to incorporate more variables?
- use an additional encoding,
-size,
- color/saturation

Google Data Visualization to find more blog and materials.

Monday, June 23, 2014

Project 3 Analyzing NYC Subway and Weather Data

Exercise 1 - Exploratory Data Analysis

import numpy as np
import pandas
import matplotlib.pyplot as plt
def entries_histogram(turnstile_weather):
    '''
    Before we perform any analysis, it might be useful to take a
    look at the data we're hoping to analyze. More specifically, let's
    examine the hourly entries in our NYC subway data and determine what
    distribution the data follows. This data is stored in a dataframe
    called turnstile_weather under the ['ENTRIESn_hourly'] column.

    Let's plot two histograms on the same axes to show hourly
    entries when raining vs. when not raining. Here's an example on how
    to plot histograms with pandas and matplotlib:
    turnstile_weather['column_to_graph'].hist()

    Your histograph may look similar to bar graph in the instructor notes below.

    You can read a bit about using matplotlib and pandas to plot histograms here:
    http://pandas.pydata.org/pandas-docs/stable/visualization.html#histograms

    You can see the information contained within the turnstile weather data here:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
    '''

    plt.figure()
    turnstile_weather['...'] # your code here to plot a historgram for hourly entries when it is raining
    turnstile_weather['...'] # your code here to plot a historgram for hourly entries when it is not raining
    return plt

Exercise 2 - Welch's t-test?

Does entries data from previous exercise seem normally distributed? No

Can we run Welch's T Test on entries data? Yes.

Correct. Good job!

Exercise 3 - Mann Whitney U Test

import numpy as np
import scipy
import scipy.stats
import pandas
def mann_whitney_plus_means(turnstile_weather):
    '''
    This function will consume the turnstile_weather dataframe containing
    our final turnstile weather data.

    You will want to take the means and run the Mann Whitney U test on the
    ENTRIESn_hourly column in the turnstile_weather dataframe.

    This function should return:
        1) the mean of entries with rain
        2) the mean of entries without rain
        3) the Mann-Whitney U statistic and p-value comparing number of entries
           with rain and the number of entries without rain.

    You should feel free to use scipy's Mann-Whitney implementation, and
    also might find it useful to use numpy's mean function.

    Here are some documentations:
    http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html

    You can look at the final turnstile weather data at the link below:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
    '''


    return with_rain_mean, without_rain_mean, U, p

Exercise 4 Rainy Day Ridership vs. Non Rainy Ridership

Is the distribution of the number of entries statistically different between rainy and non rainy days? Yes.

Described your results and methods. Rainy days have more ridership.

Correct. Good job!

Exercise 5 Linear Regression

import numpy as np
import pandas
def normalize_features(array):
   """
   Normalize the features in our data set.
   """
   array_normalized = (array-array.mean())/array.std()
   mu = array.mean()
   sigma = array.std()
   return array_normalized, mu, sigma
def compute_cost(features, values, theta):
    """
    Compute the cost function given a set of features / values, and the values for our thetas.

    This should be the same code as the compute_cost function in the lesson #3 exercises. But
    feel free to implement your own.
    """

    # your code here
    return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.

    This is the same gradient descent code as in the lesson #3 exercises. But feel free
    to implement your own.
    """
    m = len(values)
    cost_history = []
    for i in range(num_iterations):
        # your code here
    return theta, pandas.Series(cost_history)
def predictions(dataframe):
    '''
    The NYC turnstile data is stored in a pandas dataframe called weather_turnstile.
    Using the information stored in the dataframe, lets predict the ridership of
    the NYC subway using linear regression with gradient descent.

    You can look at information contained in the turnstile weather dataframe
    at the link below:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv

    Your prediction should have a R^2 value of .40 or better.

    Note: due to the memory and CPU limitation of our amazon EC2 instance, we will
    give you a random subet (~15%) of the data contained in turnstile_data_master_with_weather.csv

    If you receive a "server has encountered an error" message, that means you are hitting
    the 30 second limit that's placed on running your program. Try using a smaller number
    for num_iterations if that's the case.

    Or if you are using your own algorithm/modesl, see if you can optimize your code so it
    runs faster.
    '''
    dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
    features = dataframe[['rain', 'precipi', 'Hour', 'meantempi']].join(dummy_units)
    values = dataframe[['ENTRIESn_hourly']]
    m = len(values)
    features, mu, sigma = normalize_features(features)
    features['ones'] = np.ones(m)
    features_array = np.array(features)
    values_array = np.array(values).flatten()
    #Set values for alpha, number of iterations.
    alpha = 0.1 # please feel free to play with this value
    num_iterations = 75 # please feel free to play with this value
    #Initialize theta, perform gradient descent
    theta_gradient_descent = np.zeros(len(features.columns))
    theta_gradient_descent, cost_history = gradient_descent(features_array, values_array, theta_gradient_descent,
                                                            alpha, num_iterations)
    predictions = np.dot(features_array, theta_gradient_descent)
    return predictions

Exercise 6 Plotting Residuals

import numpy as np
import scipy
import matplotlib.pyplot as plt
def plot_residuals(turnstile_weather, predictions):
    '''
    Using the same methods that we used to plot a histogram of entries
    per hour for our data, why don't you make a histogram of the residuals
    (that is, the difference between the original hourly entry data and the predicted values).
    Based on this residual histogram, do you have any insight into how our model
    performed? Reading a bit on this webpage might be useful:
    http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm
    '''

    plt.figure()
    (turnstile_weather['''choose the right column here'''] - predictions).hist()
    return plt

Exercise 7 Compute R Squared

import numpy as np
import scipy
import matplotlib.pyplot as plt
import sys
def compute_r_squared(data, predictions):
    '''
    In exercise 5, we calculated the R^2 value for you. But why don't you try and
    and calculate the R^2 value yourself.

    Given a list of original data points, and also a list of predicted data points,
    write a function that will compute and return the coefficient of determination (R^2)
    for this data. numpy.mean() and numpy.sum() might both be useful here, but
    not necessary.
    Documentation about numpy.mean() and numpy.sum() below:
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
    '''

    # your code here

    return r_squared

Exercise 8 Non Gradient Decent Linear Regression

import numpy as np
import pandas
import scipy
import statsmodels
"""
In this optional exercise, you should complete the function called
predictions(turnstile_weather). This function takes in our pandas
turnstile weather dataframe, and returns a set of predicted ridership values,
based on the other information in the dataframe.
You should attempt to implement another type of linear regression,
that you may have read about, such as ordinary least squares regression:
http://en.wikipedia.org/wiki/Ordinary_least_squares
This is your playground. Go wild!
How does your choice of linear regression compare to linear regression
with gradient descent?
You can look at the information contained in the turnstile_weather dataframe below:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
Note: due to the memory and CPU limitation of our amazon EC2 instance, we will
give you a random subset (~15%) of the data contained in turnstile_data_master_with_weather.csv
If you receive a "server has encountered an error" message, that means you are hitting
the 30 second limit that's placed on running your program. See if you can optimize your code so it
runs faster.
"""
def predictions(weather_turnstile):
    #
    # Your implementation goes here. Feel free to write additional
    # helper functions
    #
    return prediction

Friday, June 20, 2014

Lesson 3 Data Analysis

Statistical Rigor
    Significant tests
    - using our data, can we disprove an assumption with a pre-defined level of confidence?

Why is statistics useful?
- They provide a formalized framework for comparing and evaluating data.
- They enable us to evaluate whether perceiving effects in our dataset reflect differences across the whole population.

Statistical Significant Tests
- many tests might make assumptions about data's distribution.
- very common distribution - normal distribution (aka Gaussian distribution, bell curve)

t-test
accept or reject a null hypothesis.
Null hypothesis: a statement we are trying to disprove by running our test.
- two samples can from the same population
- a sample is drawn from a probability distribution
specified in terms of a test statistic

Test Statistic: one number that helps accept or reject the null hypothesis
t test - t

A few different versions depending on assumptions
- equal sample size?
- same variance?

Two sample test

calculate t, calculate nu, calculate p;
p value: probability of obtaining a test statistic at least as extreme as ours if null hypothesis was true.
set p critical, if p<p critical, reject null hypothesis
else cannot reject null hypothesis.

Exercise: Calculate t and Nu

Welch's t-test in Python
Is there a simple way to do this in Python?

import scipy.stats
scipy.stats ttest_ind(list_1, list_2, equal_var_false)

scipy.stats ttest_ind assumes a two-side test. How could we use the output to instead perform a one-side test?
great, p/2 < p ritical, t>0
less, p/2 >p critical, t<0

Exercise Welch's t-test

import numpy
import scipy.stats
import pandas
def compare_averages(filename):
    """
    Performs a t-test on two sets of baseball data (left-handed and right-handed hitters).
    You will be given a csv file that has three columns. A player's
    name, handedness (L for lefthanded or R for righthanded) and their
    career batting average (called 'avg'). You can look at the csv
    file via the following link:
    https://www.dropbox.com/s/xcn0u2uxm8c4n6l/baseball_data.csv

    Write a function that will read that the csv file into a pandas data frame,
    and run Welch's t-test on the two cohorts defined by handedness.

    One cohort should be a data frame of right-handed batters. And the other
    cohort should be a data frame of left-handed batters.

    We have included the scipy.stats library to help you write
    or implement Welch's t-test:
    http://docs.scipy.org/doc/scipy/reference/stats.html

    With a significance level of 95%, if there is no difference
    between the two cohorts, return a tuple consisting of
    True, and then the tuple returned by scipy.stats.ttest.

    If there is a difference, return a tuple consisting of
    False, and then the tuple returned by scipy.stats.ttest.

    For example, the tuple that you return may look like:
    (True, (9.93570222, 0.000023))
    """

Non Parametric Tests
A statistical test that does not assume our data is drawn from any particular underlying probability distribution.

Mann-Whitney U test: tests null hypothesis that the two populations are the same.
u,p=scipy.stats.Mannwhitneyu(x,y)

Non-normal Data
Shapro-Wilk test
w,p=scipy.stats.Shapiro(data)

Just a tip of iceberg
Many different statistical methods!
Data scientists have large toolkits.

Predicting Future Data
How might we use the data we've collected to make predictions the data we don't have?

Machine Learning: a branch of artificial intelligence focuses on constructing systems that learn from large amounts of data to make predictions.

Why machine learning is useful?

Statistics vs. Machine Learning
What is the difference between statistics and machine learning?
- Statistics is focused on analyzing existing data, and drawing valid conclusions.
- Machine learning is focused on making predictions.

Types of Machine Learning
Data- Model- Predictions
Supervising Learning: such detect spam emails
- Have examples with input and output;
- Predict output for future, input-only data;
- Classification;
- Regression.

Unsupervised Learning:
- Trying to understand structure of data
- Clustering

Kurt's favorite machine learning algorithm: clustering and CPA.

Predicting HR: example of supervising learning
- can we write an equation that takes a bunch of info (e.g., height, weight, birth year, position, for baseball players) and predict HR (homerun)? Using regression.

Linear Regression with Gradient Decent
If we have age, height, weight, and batting average, and are trying to predict batting average, what are input variables? Age, Height, and Weight.

Gradient Decent
Cost function: J(0), we want to minimize this!

how: apha, learning rate (it's like a decreasing slope rate)

Gradient Decent in Python

import numpy
import pandas
def compute_cost(features, values, theta):
    """
    Compute the cost function given a set of features / values, and values for our thetas.
    """
    m = len(values)
    sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
    cost = sum_of_square_errors / (2*m)
    return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.
    """
    # Write some code here that updates the values of theta a number of times equal to
    # num_iterations. Everytime you have computed the cost for a given set of thetas,
    # you should append it to cost_history. The function should return both the final
    # values of theta and the cost history.
    # YOUR CODE GOES HERE
    m=len(values)
    cost_history=[]

    for I in range (num_iterations):
         predicted_values=numpy.dot(features, theta)
         theta=theta - alpha/m*numpy.dot((predicted_values - values), features)

         cost=compute_cost(features, values, theta)
         cost_history.append(cost)

    return theta, pandas.Series(cost_history)

Good job! Your program and algorithm worked

Theta =
[ 45.35759233 -9.02442042 13.69229668]

Cost History =
0     3748.133469
1     3727.492258
2     3707.261946
3     3687.434249
4     3668.001052
5     3648.954405
6     3630.286519
7     3611.989767
8     3594.056675
9     3576.479921
10    3559.252334
11    3542.366888
12    3525.816700
13    3509.595027
14    3493.695263
...
985    2686.730820
986    2686.730290
987    2686.729764
988    2686.729240
989    2686.728720
990    2686.728203
991    2686.727690
992    2686.727179
993    2686.726672
994    2686.726168
995    2686.725668
996    2686.725170
997    2686.724676
998    2686.724185
999    2686.723697
Length: 1000, dtype: float64

Coefficient of Determination: R squared
data: yi,.......,yn
preductions: fi,......,fn
average of data= y-

R2 (R squared)

Calculating R Squared

import numpy as np
def compute_r_squared(data, predictions):
    # Write a function that, given two input numpy arrays, 'data', and 'predictions,'
    # returns the coefficient of determination, R^2, for the model that produced
    # predictions.
    #
    # Numpy has a couple of functions -- np.mean() and np.sum() --
    # that you might find useful, but you don't have to use them.
    # YOUR CODE GOES HERE

    SST=((data-np.mean(data)**2).sum()
    SSReg=((predictions-data)**2).sum()
    r_squared=1-SSReg/SST

    return r_squared

Additional Considerations
- other types of linear regression
   - ordinary least squares regression
- parameter estimation: what are the confidence intervals of our parameters? What is the likelihood we would calculate this parameter value if the parameter had no effect?
- under/ over fitting
- multiple local minima: use different random initial thetas; seed random values for repeatability.

Assignment #3
t-test: does subway ridership more in rainy or weekend?
linear regression with gradient decent: how many subway riders n a particular day and time? time of day, weekday of week, rainy or not rainy day,...

Project 2 - Wrangle NYC Subway and Weather Data

Exercise 1 Number of Rainy Days

import pandas
import pandasql

def num_rainy_days(filename):
    '''
    This function should run a SQL query on a dataframe of
    weather data. The SQL query should return one column and
    one row - a count of the number of days in the dataframe where
    the rain column is equal to 1 (i.e., the number of days it
    rained). The dataframe will be titled 'weather_data'. You'll
    need to provide the SQL query. You might find SQL's count function
    useful for this exercise. You can read more about it here:

    https://dev.mysql.com/doc/refman/5.1/en/counting-rows.html

    You might also find that interpreting numbers as integers or floats may not
    work initially. In order to get around this issue, it may be equal to cast
    these numbers as integers. This can be done by writing cast(column as integer).
    So for example, if we wanted to cast the maxtempi column as an integer, we would actually
    write something like where cast(maxtempi as integer) = 76, as opposed to simply
    where maxtempi = 76.

    You can see the weather data that we are passing in below:
    https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
    '''
    weather_data = pandas.read_csv(filename)
    q = """
    your query here
    """

    #Execute your SQL command against the pandas frame
    rainy_days = pandasql.sqldf(q.lower(), locals())
    return rainy_days

Exercise 2 Temp on Foggy and Non Foggy Days

import pandas
import pandasql

def max_temp_aggregate_by_fog(filename):
    '''
    This function should run a SQL query on a dataframe of
    weather data. The SQL query should return two columns and
    two rows - whether it was foggy or not (0 or 1) and the max
    maxtempi for that fog value (i.e., the maximum max temperature
    for both foggy and non-foggy days). The dataframe will be
    titled 'weather_data'. You'll need to provide the SQL query.

    You might also find that interpreting numbers as integers or floats may not
    work initially. In order to get around this issue, it may be useful to cast
    these numbers as integers. This can be done by writing cast(column as integer).
    So for example, if we wanted to cast the maxtempi column as an integer, we would actually
    write something like where cast(maxtempi as integer) = 76, as opposed to simply
    where maxtempi = 76.

    You can see the weather data that we are passing in below:
    https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
    '''
    weather_data = pandas.read_csv(filename)
    q = """
    your query here
    """

    #Execute your SQL command against the pandas frame
    rainy_days = pandasql.sqldf(q.lower(), locals())
    return rainy_days
Exercise 3 Mean Temp on Weekends

import pandas
import pandasql
def avg_min_temperature(filename):
    '''
    This function should run a SQL query on a dataframe of
    weather data. The SQL query should return one column and
    one row - the average meantempi on days that are a Saturday
    or Sunday (i.e., the the average mean temperature on weekends).
    The dataframe will be titled 'weather_data' and you can access
    the date in the dataframe via the 'date' column.

    You'll need to provide the SQL query.

    You might also find that interpreting numbers as integers or floats may not
    work initially. In order to get around this issue, it may be equal to cast
    these numbers as integers. This can be done by writing cast(column as integer).
    So for example, if we wanted to cast the maxtempi column as an integer, we would actually
    write something like where cast(maxtempi as integer) = 76, as opposed to simply
    where maxtempi = 76.

    Also, you can convert dates to days of the week via the 'strftime' keyword in SQL.
    For example, cast (strftime('%w', date) as integer) will return 0 if the date
    is a Sunday or 6 if the date is a Saturday.

    You can see the weather data that we are passing in below:
    https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
    '''
    weather_data = pandas.read_csv(filename)
    q = """
    your query here
    """

    #Execute your SQL command against the pandas frame
    mean_temp_weekends = pandasql.sqldf(q.lower(), locals())
    return mean_temp_weekends

Exercise 4 Mean Temp on Rainy Days

import pandas
import pandasql
def avg_min_temperature(filename):
    '''
    This function should run a SQL query on a dataframe of
    weather data. More specifically you want to find the average
    minimum temperature on rainy days where the minimum temperature
    is greater than 55 degrees.

    You might also find that interpreting numbers as integers or floats may not
    work initially. In order to get around this issue, it may be equal to cast
    these numbers as integers. This can be done by writing cast(column as integer).
    So for example, if we wanted to cast the maxtempi column as an integer, we would actually
    write something like where cast(maxtempi as integer) = 76, as opposed to simply
    where maxtempi = 76.

    You can see the weather data that we are passing in below:
    https://www.dropbox.com/s/7sf0yqc9ykpq3w8/weather_underground.csv
    '''
    weather_data = pandas.read_csv(filename)
    q = """
    your query here
    """

    #Execute your SQL command against the pandas frame
    mean_temp_weekends = pandasql.sqldf(q.lower(), locals())
    return mean_temp_weekends

Exercise 5 Fixing NYC Subway Turnstile Data

import csv
def fix_turnstile_data(filenames):
    '''
    Filenames is a list of MTA Subway turnstile text files. A link to an example
    MTA Subway turnstile text file can be seen at the URL below:
    http://web.mta.info/developers/data/nyct/turnstile/turnstile_110507.txt

    As you can see, there are numerous data points included in each row of the
    a MTA Subway turnstile text file.
    You want to write a function that will update each row in the text
    file so there is only one entry per row. A few examples below:
    A002,R051,02-00-00,05-28-11,00:00:00,REGULAR,003178521,001100739
    A002,R051,02-00-00,05-28-11,04:00:00,REGULAR,003178541,001100746
    A002,R051,02-00-00,05-28-11,08:00:00,REGULAR,003178559,001100775

    Write the updates to a different text file in the format of "updated_" + filename.
    For example:
        1) if you read in a text file called "turnstile_110521.txt"
        2) you should write the updated data to "updated_turnstile_110521.txt"
    The order of the fields should be preserved.

    You can see a sample of the turnstile text file that's passed into this function
    and the the corresponding updated file in the links below:

    Sample input file:
    https://www.dropbox.com/s/mpin5zv4hgrx244/turnstile_110528.txt
    Sample updated file:
    https://www.dropbox.com/s/074xbgio4c39b7h/solution_turnstile_110528.txt
    '''
    for name in filenames:
        # your code here
Exercise 6 Combining NYC Subway Turnstile Data

def create_master_turnstile_file(filenames, output_file):
    '''
    Write a function that takes the files in the list filenames, which all have the
    columns 'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn', and consolidates
    them into one file located at output_file. There should be ONE row with the column
    headers, located at the top of the file.

    For example, if file_1 has:
    'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
    line 1 ...
    line 2 ...

    and another file, file_2 has:
    'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
    line 3 ...
    line 4 ...
    line 5 ...

    We need to combine file_1 and file_2 into a master_file like below:
     'C/A, UNIT, SCP, DATEn, TIMEn, DESCn, ENTRIESn, EXITSn'
    line 1 ...
    line 2 ...
    line 3 ...
    line 4 ...
    line 5 ...
    '''
    with open(output_file, 'w') as master_file:
       master_file.write('C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn\n')
       for filename in filenames:
                # your code here
Exercise 7 Filtering "Non Regular" Data

import pandas
def filter_by_regular(filename):
    '''
    This function should read the csv file located at filename into a pandas dataframe,
    and filter the dataframe to only rows where the 'DESCn' column has the value 'REGULAR'.

    For example, if the pandas dataframe is as follows:
    ,C/A,UNIT,SCP,DATEn,TIMEn,DESCn,ENTRIESn,EXITSn
    0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
    1,A002,R051,02-00-00,05-01-11,04:00:00,DOOR,3144335,1088159
    2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
    3,A002,R051,02-00-00,05-01-11,12:00:00,DOOR,3144424,1088231

    The dataframe will look like below after filtering to only rows where DESCn column
    has the value 'REGULAR':
    0,A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,3144312,1088151
    2,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177
    '''

    turnstile_data = # your code here
    # more of your code here
    return turnstile_data

Exercise 8 Get Hourly Entries

import pandas
def get_hourly_entries(df):
    '''
    The data in the MTA Subway Turnstile data reports on the cumulative
    number of entries and exits per row. Assume that you have a dataframe
    called df that contains only the rows for a particular turnstile machine
    (i.e., unique SCP, C/A, and UNIT). This function should change
    these cumulative entry numbers to a count of entries since the last reading
    (i.e., entries since the last row in the dataframe).

    More specifically, you want to do two things:
       1) Create a new column called ENTRIESn_hourly
       2) Assign to the column the difference between ENTRIESn of the current row
          and the previous row. If there is any NaN, fill/replace it with 1.

    You may find the pandas functions shift() and fillna() to be helpful in this exercise.

    Examples of what your dataframe should look like at the end of this exercise:

           C/A UNIT       SCP     DATEn     TIMEn    DESCn ENTRIESn    EXITSn ENTRIESn_hourly
    0     A002 R051 02-00-00 05-01-11 00:00:00 REGULAR   3144312   1088151                1
    1     A002 R051 02-00-00 05-01-11 04:00:00 REGULAR   3144335   1088159               23
    2     A002 R051 02-00-00 05-01-11 08:00:00 REGULAR   3144353   1088177               18
    3     A002 R051 02-00-00 05-01-11 12:00:00 REGULAR   3144424   1088231               71
    4     A002 R051 02-00-00 05-01-11 16:00:00 REGULAR   3144594   1088275              170
    5     A002 R051 02-00-00 05-01-11 20:00:00 REGULAR   3144808   1088317              214
    6     A002 R051 02-00-00 05-02-11 00:00:00 REGULAR   3144895   1088328               87
    7     A002 R051 02-00-00 05-02-11 04:00:00 REGULAR   3144905   1088331               10
    8     A002 R051 02-00-00 05-02-11 08:00:00 REGULAR   3144941   1088420               36
    9     A002 R051 02-00-00 05-02-11 12:00:00 REGULAR   3145094   1088753              153
    10    A002 R051 02-00-00 05-02-11 16:00:00 REGULAR   3145337   1088823              243
    ...
    ...
    '''
    #your code here
    return df

Exercise 9 Get Hourly Exists

import pandas
def get_hourly_exits(df):
    '''
    The data in the MTA Subway Turnstile data reports on the cumulative
    number of entries and exits per row. Assume that you have a dataframe
    called df that contains only the rows for a particular turnstile machine
    (i.e., unique SCP, C/A, and UNIT). This function should change
    these cumulative exit numbers to a count of exits since the last reading
    (i.e., exits since the last row in the dataframe).

    More specifically, you want to do two things:
       1) Create a new column called EXITSn_hourly
       2) Assign to the column the difference between EXITSn of the current row
          and the previous row. If there is any NaN, fill/replace it with 0.

    You may find the pandas functions shift() and fillna() to be helpful in this exercise.

    Example dataframe below:
          Unnamed: 0   C/A UNIT       SCP     DATEn     TIMEn    DESCn ENTRIESn    EXITSn ENTRIESn_hourly EXITSn_hourly
    0              0 A002 R051 02-00-00 05-01-11 00:00:00 REGULAR   3144312   1088151                0              0
    1              1 A002 R051 02-00-00 05-01-11 04:00:00 REGULAR   3144335   1088159               23              8
    2              2 A002 R051 02-00-00 05-01-11 08:00:00 REGULAR   3144353   1088177               18             18
    3              3 A002 R051 02-00-00 05-01-11 12:00:00 REGULAR   3144424   1088231               71             54
    4              4 A002 R051 02-00-00 05-01-11 16:00:00 REGULAR   3144594   1088275              170             44
    5              5 A002 R051 02-00-00 05-01-11 20:00:00 REGULAR   3144808   1088317              214             42
    6              6 A002 R051 02-00-00 05-02-11 00:00:00 REGULAR   3144895   1088328               87             11
    7              7 A002 R051 02-00-00 05-02-11 04:00:00 REGULAR   3144905   1088331               10              3
    8              8 A002 R051 02-00-00 05-02-11 08:00:00 REGULAR   3144941   1088420               36             89
    9              9 A002 R051 02-00-00 05-02-11 12:00:00 REGULAR   3145094   1088753              153            333
    '''

    #your code here
    return df

Exercise 10 Time to Hour

import pandas
def time_to_hour(time):
    '''
    Given an input variable time that represents time in the format of:
    00:00:00 (hour:minutes:seconds)

    Write a function to extract the hour part from the input variable time
    and return it as an integer. For example:
        1) if hour is 00, your code should return 0
        2) if hour is 01, your code should return 1
        3) if hour is 21, your code should return 21

    Please return hour as an integer.
    '''

    hour = # your code here
    return hour

Exercise 11 Reformat Subway Dates

import datetime
def reformat_subway_dates(date):
    '''
    The dates in our subway data are formatted in the format month-day-year.
    The dates in our weather underground data are formatted year-month-day.

    In order to join these two data sets together, we'll want the dates formatted
    the same way. Write a function that takes as its input a date in the MTA Subway
    data format, and returns a date in the weather underground format.

    Hint:
    There is a useful function in the datetime library called strptime.
    More info can be seen here:
    http://docs.python.org/2/library/datetime.html#datetime.datetime.strptime
    '''
    date_formatted = # your code here
    return date_formatted

Thursday, June 12, 2014

Lesson 2 Data Wrangling

Data Manipulation
- from files, databases, and Web APIs.

Dealing with messy data. What tools do you use?

Acquiring data
- Acquiring data is often not funcy.
- Find data on the internet.
- A lot of data stored in text files and on government websites.

Common data formats:
- CSV
- XML
- JSON(nested structure)

import pandas
def add_full_name(path_to_csv, path_to_new_csv):
    #Assume you will be reading in a csv file with the same columns that the
    #Lahman baseball data set has -- most importantly, there are columns
    #called 'nameFirst' and 'nameLast'.
    #1) Write a function that reads a csv
    #located at "path_to_csv" into a pandas dataframe and adds a new column
    #called 'nameFull' with a player's full name.
    #
    #For example:
    #   for Hank Aaron, nameFull would be 'Hank Aaron',
#
#2) Write the data in the pandas dataFrame to a new csv file located at
#path_to_new_csv
    #WRITE YOUR CODE HERE
    baseballData = pandas.read_csv(path_to_csv)
    baseballData['nameFull'] = baseballData['nameFirst'] + ' ' + baseballData['nameLast']
    baseballData.to_csv(path_to_new_csv)

Good job! You updated the CSV file correctly.
The following is the output by your program (showing only the name related columns):

0,Hank,Aaron,Hank Aaron
1,Tommie,Aaron,Tommie Aaron
2,Don,Aase,Don Aase
3,Andy,Abad,Andy Abad
............

Relational Databases:
Why relational database is useful?
- Straight-forward to extract aggregated data from complex files.
- Database scales well.
- It ensures all data is consistently formatted.

Database Schema
schemas=blueprints

Simple Queries
Select * from aadhaar_data;
Limit 20;
Select district subdistrict from aadhaar_data limit 20;

Write your own......

import pandas
import pandasql
def select_first_50(filename):
    # Read in our aadhaar_data csv to a pandas dataframe. Afterwards, we rename the columns
    # by replacing spaces with underscores and setting all characters to lowercase, so the
    # column names more closely resemble columns names one might find in a table.
    aadhaar_data = pandas.read_csv(filename)
    aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
    # Select out the first 50 values for "registrar" and "enrolment_agency"
    # in the aadhaar_data table using SQL syntax.
    #
    # Note that "enrolment_agency" is spelled with one l. Also, the order
    # of the select does matter. Make sure you select registrar then enrolment agency
    # in your query.
    q = """
    #YOUR QUERY HERE
   SELECT
   registrar, enrolment_agency
   FROM
   aadhaar_data
   LIMIT 50;
    """
    #Execute your SQL command against the pandas frame
    aadhaar_solution = pandasql.sqldf(q.lower(), locals())
    return aadhaar_solution

Complex Queries

Select * from aadhaar_data
where state='Gujarat';

Select district, sum(aadhaar_generated)
from aadhaar_data group by district;

Select district, subdistrict sum(aadhaar_generated)
from aadhaar_data where age>60 group by district, subdistrict;

Write your own....

import pandas
import pandasql
def aggregate_query(filename):
    # Read in our aadhaar_data csv to a pandas dataframe. Afterwards, we rename the columns
    # by replacing spaces with underscores and setting all characters to lowercase, so the
    # column names more closely resemble columns names one might find in a table.

    aadhaar_data = pandas.read_csv(filename)
    aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
    # Write a query that will select from the aadhaar_data table how many men and how
    # many women over the age of 50 have had aadhaar generated for them in each district
    #
    # Note that in this quiz, the SQL query keywords are case sensitive.
    # For example, if you want to do a sum make sure you type 'sum' rather than 'SUM'.
    #
    # The possible columns to select from aadhaar data are:
    #     1) registrar
    #     2) enrolment_agency
    #     3) state
    #     4) district
    #     5) sub_district
    #     6) pin_code
    #     7) gender
    #     8) age
    #     9) aadhaar_generated
    #     10) enrolment_rejected
    #     11) residents_providing_email,
    #     12) residents_providing_mobile_number
    #
    # You can download a copy of the aadhaar data that we are passing
    # into this exercise below:
    # https://www.dropbox.com/s/vn8t4uulbsfmalo/aadhaar_data.csv

    q = """# your code here
    select gender, district, sum(aadhaar_generated)
    from
    aadhaar_data
    where
    age > 50
    group by
    gender, district;
    """

    # Execute your SQL command against the pandas frame
    aadhaar_solution = pandasql.sqldf(q.lower(), locals())
    return aadhaar_solution

APIs
Files, Databases, APIs
Application Programming Interface

How to Access API Efficiently

API Exercise:

import json

import requests

def api_get_request(url):

# In this exercise, you want to call the last.fm API to get a list of the

# top artists in Spain.

# Once you've done this, return the name of the number 1 top artist in Spain.

url = 'http://ws.audioscrobbler.com/2.0/?method=geo.gettopartists&country=spain&api_key=0a5de916777ad83d6de29347308556a1&format=json'

data = requests.get(url).text

data = json.loads(data)

print data['topartists']['artist'][0]['name']

# return the top artist in Spain

Sanity Checking Data

Does data make sense?

Is there a problem?

Does the data look like I expect it to?

baseball=pandas.read_csv('../data/master.csv')

baseball.describe()

Count, Mean, Standard Deviation, Minimum, Maximum , 25%, 50%, 75%

Missing Values - why are they missing?
1. Occasionally system error prevent data from being recorded.
2. Some subset of subjects or event types are systematically missing certain data attributes, or missing entirely.

Occasional failure, nonresponse

Dealing with missing data
- Partial Deletion
(1) Listwise deletion
(2) Pairwise deletion
- Imputation
   why impute?
    - Not much data
    - removing data could affect representatives
    - many different imputation techniques
    - imputation is really hard to get right
- how to deal with
   (1) use mean value, good- doesn't change mean across sample; bad - lessens correlations between variables.
   (2) impute using linear regression,
         data we have ->make an equation to predict variable missing -> predict missing values
   Drawbacks: Over emphasize trends; Exact value suggest too much certainty

Imputaton Exercise:

from pandas import *
import numpy
def imputation(filename):
    # Pandas dataframes have a method called 'fillna(value)', such that you can
    # pass in a single value to replace any NAs in a dataframe or series. You
    # can call it like this:
    #     dataframe['column'] = dataframe['column'].fillna(value)
    #
    # Using the numpy.mean function, which calculates the mean of a numpy
    # array, impute any missing values in our Lahman baseball
    # data sets 'weight' column by setting them equal to the average weight.
    #
    # You can access the 'weight' colum in the baseball data frame by
    # calling baseball['weight']
    #baseball = pandas.read_csv(filename)

    #YOUR CODE GOES HERE
    baseball=pandas.read_csv('../data/Master.csv')
    baseball['weight'] = baseball['weight'].fillna(numpy.mean(baseball['weight']))
    print numpy.sum(baseball['weight']), numpy.mean(baseball['weight'])

    return baseball

Imputation:
- just the tip of the iceberg; more sophisticated methods exist;
- fill in mean and
- linear regression , both simple and relatively effective;
- both have negative side effects and can obscure amplify the trends in data;

Assignment #2
- Acquire weather data via weather underground API
- Get a sense of data using SQL queries
- Clean and process the data