Monday, September 1, 2014

Google Analytics Platform Principles

Instructor Justin Cutroni introduces the topics about the Google Analytics Platform

Unit 1 - Platform Fundamentals
Course overview

The platform components
Collection
For example, you’ll use the JavaScript tracking code to collect data from a website, but a Software Development Kit, called an SDK, to collect data from a mobile app.

Processing and Configuration
You can also configure Google Analytics to import data directly into your reports from other Google products, like Google AdWords, Google AdSense and Google Webmaster Tools. You can even configure Google Analytics to import data from non-Google sources, like your own internal data.

Reporting
Using the API you can build your own reporting tools or extract your data directly into third-party reporting tools.

The data model
Users: visitors
Sessions
Interactions: Individual interactions are called hits.

Each interaction that Google Analytics tracks belongs to a session, and each session is associated with a user.


Unit 2 - Collection
Data collection overview
  • it’s the tracking code that gathers and sends the data back to your account for reporting
  • Using hits to collect and send data
  • In the example, everything after the question mark is called a parameter. Each parameter carries a piece of information back to Google’s analytic servers.
  • utmul=, which shows the language that the user’s browser is set to.
  • In addition to creating hits, the tracking code also performs another critical function. It identifies new users and returning users. 
<!-- Google Analytics -->

<script>

(function(i,s,o,g,r,a,m){i[GoogleAnalyticsObject]=r;i[r]=i[r]||function(){

(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),

m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)

})(window,document,script,//www.google-analytics.com/analytics.js,ga);

ga(create, UA-12345-6, auto);
ga(send, pageview);

</script>

<!-- End Google Analytics
For your own account, you can find the tracking ID in the account administrative settings.

Website data collection
Understand how the Google Analytics JavaScript website tracking code collects data.

To track data from a website, Google Analytics provides a standard snippet of JavaScript tracking code. This snippet references a JavaScript library calledanalytics.js that controls what data is collected.

  • Adding the Google Analytics JavaScript code to your website
  1. Simply add the standard code snippet before the closing </head> tag in the HTML of every web page you want to track.
  • Functions of the web tracking code
  1. The Google Analytics tracking code executes JavaScript asynchronously, meaning that the JavaScript runs in the background while the browser performs other tasks.


Mobile app data collection
  • Using the Google Analytics mobile SDKs
  1. Instead of using JavaScript to collect data like you do on a website, youll use an SDK, or Software Development Kit, to collect data from your mobile app. 
  • Dispatching
  1. When a user navigates through an app, the Google Analytics SDK stores the hits locally on the device and then sends them to your Google Analytics account later in a batch process called dispatching.
  2. For these reasons, the SDKs automatically dispatch hits every 30 minutes for Android devices and every two minutes for iOS devices, 
The mobile SDKs provide a simple way to track user activity from an app, and collect most of the data you’ll need without any customization. But keep in mind, there are many ways to modify your code to collect additional information about your users, their sessions and their interactions with your app.


Measurement Protocol data collection
what if you want to collect data from some other kind of device? For example, you might want to track a point-of-sale system or user activity on a kiosk.

  • Collecting and sending data with the Measurement Protocol
  1. For instance, let’s say we want to collect data from a kiosk. Here’s a sample hit that will track when a user views a screen on the kiosk:
  2. http://www.google-analytics.com/collect?v=1&tid=UA-XXXX-Y&cid=555&sr=800x600&t=pageview&dh=mydemo.com&dp=/home&dt=homepage
  3. Notice there is a parameter in the hit that contains the screen resolution. This particular parameter will become the dimension Screen Resolution during processing. 

With the Measurement Protocol, you can use Google Analytics to collect data from any kind of device.


Unit 3 - Processing & Configuration
Processing & configuration overview

Processing data and applying your configuration settings
  1. Google Analytics organizes the hits you’ve collected into users and sessions.
  2. data from other sources can be joined with data collected via the tracking code. For example, you can configure Google Analytics to import data from Google AdWords, Google AdSense or Google Webmaster Tools. 
  3. Google Analytics processing will modify your data according to any configuration rules you’ve added. 
  4. the data goes through a process called “aggregation.” 
Understanding how Google Analytics transforms raw data during processing, and how your configuration settings can control what happens during processing, will help you better interpret and manage the data in your reports.

Processing hits into sessions & users
  • How hits are organized by users
  1. Google Analytics creates a random, unique ID that is associated with the device. Each unique ID is considered to be a unique user in Google Analytics.
  2. It’s possible for these IDs to get reset or erased. This happens if a user clears their cookies in a web browser, or uninstalls and then reinstalls a mobile app.
  • How hits are organized into sessions
  1. A session in Google Analytics is a collection of interactions, or hits, from a specific user during a defined period of time.
  2. By default, a session ends after 30 minutes of inactivity. 
Users and sessions are a critical part of the digital analytics data model. All of the reports you see in Google Analytics depend on this model to organize the data.

Importing data into Google Analytics
The most common way to get data into Google Analytics is through your tracking code, but you can also add data from other sources.

  • Importing data into Google Analytics
  • Account linking
You can link various Google products directly to Google Analytics via your account settings. This includes:
  1. Google AdWords
  2. Google AdSense
  3. Google Webmaster Tools
  • Data Import
There are two ways to import data into Google Analytics:

(1). Dimension Widening
  1. if you’re a publisher you might want to segment your data based on the author and topic of your online articles. 
  2. With Dimension Widening, you could import author and topic as new dimensions for your content pages. You could use each article’s page URL as the “key” that links the new data to your existing Google Analytics data. 
  3. You can add data using Dimension Widening either by uploading a file or by using the Google Analytics APIs. Uploading a file, like a spreadsheet or .CSV, is easy, but it can be time consuming if you need add data often. To save time, you can build a program that uses the APIs to automatically send data into Google Analytics on a regular basis.

(2). Cost Data Import
  1. You use this feature specifically to add data that shows the amount of money you spent on your non-Google advertising. Importing cost data lets Google Analytics calculate the return-on-investment of your non-Google ads.

Although you’ll collect most of your data using the tracking code, account linking and data import are two powerful ways to add more context to your Google Analytics data. By choosing the right data sources to link or import into Google Analytics, you can better measure the performance of your business.


Transforming & aggregating data
  • The role of configuration settings during processing
  1. including data, excluding data, or modifying how data appears in a reporting View
  2. Filters, Goals, and Grouping.
  • Common configuration settings: Filters
  1. Filters provide a flexible way you can modify the data within each view.
  2. Example, create a filter to exclude traffic from a particular IP address or to convert messy page URLs into readable text.
  • Common configuration settings: Goals
  1. Goals let you specify which pageviews, screen views or other hits should be used to calculate conversions. 
  • Common configuration settings: Channel Grouping and Content Grouping
  1. aggregate certain pieces of data together so you can analyze the collective performance.
  2. You can create two types of groups in Google Analytics: Channel groups and Content groups.
  3. A Channel Group is a collection of common marketing activities. For example, Display Advertising, Social media, Email marketing, and Paid Search are four common channel groups that are each a roll-up of several marketing activities.
  4. Content Groups are like Channel Groups, except you use them to create and analyze a collection of content. 
  • Data aggregation
  1. During aggregation, Google Analytics creates and organizes your report dimensions into tables, called aggregate tables. 


Unit 4 - Reporting

Reporting overview
  • All Google Analytics reports are based on different combinations of dimensions and metrics. 
  • In addition to the reporting interface, you can use an API, or an Application Programming Interface, to extract your data directly from Google Analytics.
  • Most of the time, when you request data from the reporting interface or the APIs you’ll receive your data almost immediately. But in some cases, where you request complex data, Google Analytics uses a process called sampling.


Building reports with dimensions & metrics
Understand how metrics and dimensions are combined to build reports and which dimensions and metrics can be used together
  • Dimensions in Google Analytics
  1. a dimension of a session is the traffic source that brought the user to your site.
  • Metrics in Google Analytics
  1. Metrics are the quantitative measurements of your data. They count how often things happen, like the total number of users on a website or app. 
  • Combining dimensions and metrics in reports
  1. Most commonly, you’ll see dimensions and metrics reported in a table, with the first column containing the values for one particular dimension, and the rest of the columns displaying the corresponding metrics.

The reporting APIs

  • To use the reporting APIs, you have to build your own application. This application needs to be able to write and send a query to the reporting API. The API uses the query to retrieve data from the aggregate tables, and then sends a response back to your application containing the data that was requested.
  • the reporting APIs give you the power to automate and streamline complex reporting tasks for your business.


Report sampling
  • When does sampling happen?
  • Google Analytics checks to see how many sessions should be included in your request. If the number of sessions is small enough, Google Analytics can calculate the data for your request using all of the sessions. If the number of sessions is too large, Google Analytics uses a sample to fulfill the request
  • Adjusting the sample size
  • The sampling limit
  • Google Analytics sets a maximum number of sessions that can be used to calculate your reports. If you go over that limit, your data gets sampled.


Final Assessment


Project 5 - MapReduce on NYC Subway and Weather Data

import sys
import string
import logging

from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
                    level=logging.INFO, filemode='w')

def mapper():
    """
    The input to this mapper will be the final Subway-MTA dataset, the same as
    in the previous exercise.  You can check out the csv and its structure below:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv

    For each line of input, the mapper output should PRINT (not return) the UNIT as
    the key, the number of ENTRIESn_hourly as the value, and separate the key and
    the value by a tab. For example: 'R002\t105105.0'

    Since you are printing the output of your program, printing a debug
    statement will interfere with the operation of the grader. Instead,
    use the logging module, which we've configured to log to a file printed
    when you click "Test Run". For example:
    logging.info("My debugging message")
   
    The logging module can be used to give you more control over your debugging
    or other messages than you can get by printing them. In this exercise, print
    statements from your mapper will go to your reducer, and print statements
    from your reducer will be considered your final output. By contrast, messages
    logged via the loggers we configured will be saved to two files, one
    for the mapper and one for the reducer. If you click "Test Run", then we
    will show the contents of those files once your program has finished running.
    The logging module also has other capabilities; see
    https://docs.python.org/2/library/logging.html for more information.
    """

    for line in sys.stdin:
        # your code here


mapper()



import sys
import logging

from util import reducer_logfile
logging.basicConfig(filename=reducer_logfile, format='%(message)s',
                    level=logging.INFO, filemode='w')

def reducer():
    '''
    Given the output of the mapper for this exercise, the reducer should PRINT 
    (not return) one line per UNIT along with the total number of ENTRIESn_hourly 
    over the course of May (which is the duration of our data), separated by a tab.
    An example output row from the reducer might look like this: 'R001\t500625.0'

    You can assume that the input to the reducer is sorted such that all rows
    corresponding to a particular UNIT are grouped together.

    Since you are printing the output of your program, printing a debug 
    statement will interfere with the operation of the grader. Instead, 
    use the logging module, which we've configured to log to a file printed 
    when you click "Test Run". For example:
    logging.info("My debugging message")
    '''

    for line in sys.stdin:
        # your code here

        
reducer()

Lesson 5 - MapReduce

Big Data and MapReduce
Catalog and index all books in the world.

Scenarios for MapReduce
In which of the situations below do you think mapreduce may have been used?
1. discover new oil resources
2. power an e-commerce website.
3. identify the malware and cyber attack  patterns for online security.
4. helps doctors answer questions about patients' health.

Basics of MapReduce

Counting words serially

import logging
import sys
import string

from util import logfile

logging.basicConfig(filename=logfile, format='%(message)s',
                   level=logging.INFO, filemode='w')


def word_count():
    # For this exercise, write a program that serially counts the number of occurrences
    # of each word in the book Alice in Wonderland.
    #
    # The text of Alice in Wonderland will be fed into your program line-by-line.
    # Your program needs to take each line and do the following:
    # 1) Tokenize the line into string tokens by whitespace
    #    Example: "Hello, World!" should be converted into "Hello," and "World!"
    #    (This part has been done for you.)
    #
    # 2) Remove all punctuation
    #    Example: "Hello," and "World!" should be converted into "Hello" and "World"
    #
    # 3) Make all letters lowercase
    #    Example: "Hello" and "World" should be converted to "hello" and "world"
    #
    # Store the the number of times that a word appears in Alice in Wonderland
    # in the word_counts dictionary, and then *print* (don't return) that dictionary
    #
    # In this exercise, print statements will be considered your final output. Because
    # of this, printing a debug statement will cause the grader to break. Instead,
    # you can use the logging module which we've configured for you.
    #
    # For example:
    # logging.info("My debugging message")
    #
    # The logging module can be used to give you more control over your
    # debugging or other messages than you can get by printing them. Messages
    # logged via the logger we configured will be saved to a
    # file. If you click "Test Run", then you will see the contents of that file
    # once your program has finished running.
    #
    # The logging module also has other capabilities; see
    # https://docs.python.org/2/library/logging.html
    # for more information.

    word_counts = {}

    for line in sys.stdin:
        data = line.strip().split(" ")
     
        # Your code here
    for i in data:
     key = i.translate(string.maketrans("","",string.punctuation).lower()
    if key in word_counts.key():
        word_counts[key] +1=1
    else
        word_counts[key] =1
    print word_counts

word_count()

Counting words in MapReduce

Mapper stage

Reducer stage

Using MapReduce with Aadhar Data

import sys
import string
import logging

from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
                    level=logging.INFO, filemode='w')

def mapper():

    #Also make sure to fill out the reducer code before clicking "Test Run" or "Submit".

    #Each line will be a comma-separated list of values. The
    #header row WILL be included. Tokenize each row using the
    #commas, and emit (i.e. print) a key-value pair containing the
    #district (not state) and Aadhaar generated, separated by a tab.
    #Skip rows without the correct number of tokens and also skip
    #the header row.

    #You can see a copy of the the input Aadhaar data
    #in the link below:
    #https://www.dropbox.com/s/vn8t4uulbsfmalo/aadhaar_data.csv

    #Since you are printing the output of your program, printing a debug
    #statement will interfere with the operation of the grader. Instead,
    #use the logging module, which we've configured to log to a file printed
    #when you click "Test Run". For example:
    #logging.info("My debugging message")

    for line in sys.stdin:
    data = line.strip().split(" ")
        #your code here
    if len(data) !=12 or data[0]=='Registrar':
           continue
    print {0}\t{1}.format(data[3],data[8])

mapper()



import sys
import logging

from util import reducer_logfile
logging.basicConfig(filename=reducer_logfile, format='%(message)s',
                    level=logging.INFO, filemode='w')

def reducer():
    
    #Also make sure to fill out the mapper code before clicking "Test Run" or "Submit".

    #Each line will be a key-value pair separated by a tab character.
    #Print out each key once, along with the total number of Aadhaar 
    #generated, separated by a tab. Make sure each key-value pair is 
    #formatted correctly! Here's a sample final key-value pair: 'Gujarat\t5.0'

    #Since you are printing the output of your program, printing a debug 
    #statement will interfere with the operation of the grader. Instead, 
    #use the logging module, which we've configured to log to a file printed 
    #when you click "Test Run". For example:
    #logging.info("My debugging message")
        
    aadhaar_generated=0
    old_key=none

    for line in sys.stdin:
    data = line.strip().split("\t ")
        # your code here
    if len(data) !=2
           continue
    this_key.count=data
    if old_key and 
    old_key != this_key:
    print {0}\t{1}.format(old_key, aadhaar_generated)
    aadhaar_generated=0
  old_key=this_key
    aadhaar_generated += float(count)
    
   if old_key !=  None:
   print {0}\t{1}.format(old_key, aadhaar_generated)


reducer()


More Complex MapReduce
How do we do more complex thing with MapReduce?
- counting word
- aggregate aadhaar generated

MapReduce EcoSystem
MapReduce Programming Model
Haodoop:
hive - facebook
pig - yahoo


Using MapReduce with Subway Data








Project 4 Visualize NYC Subway and Weather Data

from pandas import *
from ggplot import *

def plot_weather_data(turnstile_weather):
    '''
    plot_weather_data is passed a dataframe called turnstile_weather.
    Use turnstile_weather along with ggplot to make a data visualization
    focused on the MTA and weather data we used in Project 3.
   
    You should feel free to implement something that we discussed in class
    (e.g., scatterplots, line plots, or histograms) or attempt to implement
    something more advanced if you'd like.

    Here are some suggestions for things to investigate and illustrate:
     * Ridership by time-of-day or day-of-week
     * How ridership varies by subway station
     * Which stations have more exits or entries at different times of day

    If you'd like to learn more about ggplot and its capabilities, take
    a look at the documentation at:
    https://pypi.python.org/pypi/ggplot/
   
    You can check out the link
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
    to see all the columns and data points included in the turnstile_weather
    dataframe.
   
    However, due to the limitation of our Amazon EC2 server, we will give you only
    about 1/3 of the actual data in the turnstile_weather dataframe.
    '''

    plot = # your code here
    return plot

Lesson 4 - Data Visualization

Show your findings in visualization, in Python.

What is information visualization?
1. Effective communication of complex quantitative ideas.
 - clarity, precision, efficiency.
2. Helps you notice things about data (correlations, trends, etc.) that might go unnoticed.
3. Can highlight aspects of data, or "tell a story".

Napoleon's March on Russian.

What information is depicted in this visualization?
 size of army, location of army, direction of army, temperature or various dates during the retreat.

Components of effective visualization:
1.visual cues
2. coordinate systems
3. scale / data types
4. context

Mathematical and statistical rigor. Tell a story.

Visual Encoding (cues)
Position, Length, Angle, Direction, Shape, Area/volume, Color (hue, saturation, combination, limit hue).

Perception of visual cues: 1, position, 2. angle, 3. area, 4. saturation.
1985 AT&T Lab paper on graphical perception:
position, length, angle, direction, area, volumes, saturation, hue.
Hue and saturation least accurate.

Plotting in Python:
Many packages:
1. Matplotlib
2. ggplot

We use ggplot, why? looks nicer, grammar of graphics,
First step: create plot.
Second step: represent data with geometric objects.
Third step: add labels.
print ggplot (data, aes(xvar, yvar))+ geom_point(color='coral')+ geom_line(color='coral')+ ggplot('title')+ xlab('x-lab')+ ylab('y-lab')


from pandas import *
from ggplot import *

import pandas

def lineplot(hr_year_csv):
    # A csv file will be passed in as an argument which
    # contains two columns -- 'HR' (the number of homerun hits)
    # and 'yearID' (the year in which the homeruns were hit).
    #
    # Fill out the body of this function, lineplot, to use the
    # passed-in csv file, hr_year.csv, and create a
    # chart with points connected by lines, both colored 'red',
    # showing the number of HR by year.
    #
    # You will want to first load the csv file into a pandas dataframe
    # and use the pandas dataframe along with ggplot to create your visualization
    #
    # You can check out the data in the csv file at the link below:
    # https://www.dropbox.com/s/awgdal71hc1u06d/hr_year.csv
    #
    # You can read more about ggplot at the following link:
    # https://github.com/yhat/ggplot/
 
 
    gg =
  hr_year = pandas.read_csv('hr_year.csv')
  print ggplot(hr_year, aes('yearID', 'HR')) + geom_point(color='red')+ geom_line(color='red') + ggtitle('Total HRs by Year') + xlab('Year') + ylab('HR')


    return gg


Data Type - Numeric Data
a measurement (e.g. height, weight) and a count (e.g. HR or hits)
discrete and continuous:
discrete - a certain numbers, whole number (10, 34, 25)
continuous - any numbers in a range (e.g. 0.25, 0.357, 0.511 batting rate)


Data Type - Categorical Data
represent characteristic data (e.g. position, team, hometown, handedness)
can take numerical values but they don't have mathematical meaning.
ordinal data - category with some order or ranking power.
Very Low, Low, High, Very High.


Data Type - Time Series Data
Data collected via repeated measurements over time.
Example: Average HR/player over many years.


Scales
Categorical
HR vs. Position
HR vs. Months/Years

Improper use of scales
When use incorrectly, scales can misguide or confuse readers.


Plotting Line Chart
from pandas import *
from ggplot import *

import pandas

def lineplot_compare(hr_by_team_year_sf_la_csv):
    # Write a function, lineplot_compare, that will read a csv file
    # called hr_by_team_year_sf_la.csv and plot it using pandas and ggplot2.
    #
    # This csv file has three columns: yearID, HR, and teamID. The data in the
    # file gives the total number of home runs hit each year by the SF Giants
    # (teamID == 'SFN') and the LA Dodgers (teamID == "LAN"). Produce a
    # visualization comparing the total home runs by year of the two teams.
    #
    # You can see the data in hr_by_team_year_sf_la_csv
    # at the link below:
    # https://www.dropbox.com/s/wn43cngo2wdle2b/hr_by_team_year_sf_la.csv
    #
    # Note that to differentiate between multiple categories on the
    # same plot in ggplot, we can pass color in with the other arguments
    # to aes, rather than in our geometry functions. For example,
    # ggplot(data, aes(xvar, yvar, color=category_var)). This should help you
    # in this exercise.
 
    gg = #YOUR CODE GOES HERE

  hr_year = pandas.read_csv('hr_by_team_year_sf_la.csv')
  print ggplot(hr_year, aes('yearID', 'HR', color='teamID')) + geom_point()+ geom_line() + ggtitle('Total HRs by Year') + xlab('Year') + ylab('HR')

if _name__==  '__main__':
lineplot_compare()

    return gg


Visualizing Time Series Data
- Format for MTA + Weather Data
   - Scatter Plot vs. Additional Bells & Whistle (e.g. LOESS curve)
- Boston Red Sox Winning Percentage, 1960 - 2010

Scatter Plot

Line Chart
- mitigate some shortening of scatter plot
- emphasize the trends
- focus on the year to year variability, not overall trends.

LOESS Curve
- emphasize long term trends
- LOESS weighted regression
- easier to take a quick look at chart and understand big picture.

Multivariate
- how to incorporate more variables?
- use an additional encoding,
   -size,
   - color/saturation


Google Data Visualization to find more blog and materials.






Monday, June 23, 2014

Project 3 Analyzing NYC Subway and Weather Data

Exercise 1 - Exploratory Data Analysis

import numpy as np
import pandas
import matplotlib.pyplot as plt
def entries_histogram(turnstile_weather):
    '''
    Before we perform any analysis, it might be useful to take a
    look at the data we're hoping to analyze. More specifically, let's
    examine the hourly entries in our NYC subway data and determine what
    distribution the data follows. This data is stored in a dataframe
    called turnstile_weather under the ['ENTRIESn_hourly'] column.
   
    Let's plot two histograms on the same axes to show hourly
    entries when raining vs. when not raining. Here's an example on how
    to plot histograms with pandas and matplotlib:
    turnstile_weather['column_to_graph'].hist()
   
    Your histograph may look similar to bar graph in the instructor notes below.
   
    You can read a bit about using matplotlib and pandas to plot histograms here:
    http://pandas.pydata.org/pandas-docs/stable/visualization.html#histograms
   
    You can see the information contained within the turnstile weather data here:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
    '''
   
    plt.figure()
    turnstile_weather['...'] # your code here to plot a historgram for hourly entries when it is raining
    turnstile_weather['...'] # your code here to plot a historgram for hourly entries when it is not raining
    return plt


Exercise 2 - Welch's t-test?

Does entries data from previous exercise seem normally distributed? No

Can we run Welch's T Test on entries data? Yes.

Correct. Good job!


Exercise 3 - Mann Whitney U Test

import numpy as np
import scipy
import scipy.stats
import pandas
def mann_whitney_plus_means(turnstile_weather):
    '''
    This function will consume the turnstile_weather dataframe containing
    our final turnstile weather data.
   
    You will want to take the means and run the Mann Whitney U test on the
    ENTRIESn_hourly column in the turnstile_weather dataframe.
   
    This function should return:
        1) the mean of entries with rain
        2) the mean of entries without rain
        3) the Mann-Whitney U statistic and p-value comparing number of entries
           with rain and the number of entries without rain. 
   
    You should feel free to use scipy's Mann-Whitney implementation, and
    also might find it useful to use numpy's mean function. 
   
    Here are some documentations:
    http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
   
    You can look at the final turnstile weather data at the link below:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
    '''
   
   
    return with_rain_mean, without_rain_mean, U, p


Exercise 4 Rainy Day Ridership vs. Non Rainy Ridership

Is the distribution of the number of entries statistically different between rainy and non rainy days? Yes.

Described your results and methods. Rainy days have more ridership.

Correct. Good job!


Exercise 5 Linear Regression

import numpy as np
import pandas
def normalize_features(array):
   """
   Normalize the features in our data set.
   """
   array_normalized = (array-array.mean())/array.std()
   mu = array.mean()
   sigma = array.std()
   return array_normalized, mu, sigma
def compute_cost(features, values, theta):
    """
    Compute the cost function given a set of features / values, and the values for our thetas.
   
    This should be the same code as the compute_cost function in the lesson #3 exercises. But
    feel free to implement your own.
    """
   
    # your code here
    return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.
   
    This is the same gradient descent code as in the lesson #3 exercises. But feel free
    to implement your own.
    """
    m = len(values)
    cost_history = []
    for i in range(num_iterations):
        # your code here
    return theta, pandas.Series(cost_history)
def predictions(dataframe):
    '''
    The NYC turnstile data is stored in a pandas dataframe called weather_turnstile.
    Using the information stored in the dataframe, lets predict the ridership of
    the NYC subway using linear regression with gradient descent.
   
    You can look at information contained in the turnstile weather dataframe
    at the link below:
    https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv   
   
    Your prediction should have a R^2 value of .40 or better.
   
    Note: due to the memory and CPU limitation of our amazon EC2 instance, we will
    give you a random subet (~15%) of the data contained in turnstile_data_master_with_weather.csv
   
    If you receive a "server has encountered an error" message, that means you are hitting
    the 30 second  limit that's placed on running your program. Try using a smaller number
    for num_iterations if that's the case.
   
    Or if you are using your own algorithm/modesl, see if you can optimize your code so it
    runs faster.
    '''
    dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
    features = dataframe[['rain', 'precipi', 'Hour', 'meantempi']].join(dummy_units)
    values = dataframe[['ENTRIESn_hourly']]
    m = len(values)
    features, mu, sigma = normalize_features(features)
    features['ones'] = np.ones(m)
    features_array = np.array(features)
    values_array = np.array(values).flatten()
    #Set values for alpha, number of iterations.
    alpha = 0.1 # please feel free to play with this value
    num_iterations = 75 # please feel free to play with this value
    #Initialize theta, perform gradient descent
    theta_gradient_descent = np.zeros(len(features.columns))
    theta_gradient_descent, cost_history = gradient_descent(features_array, values_array, theta_gradient_descent,
                                                            alpha, num_iterations)
    predictions = np.dot(features_array, theta_gradient_descent)
    return predictions


Exercise 6 Plotting Residuals

import numpy as np
import scipy
import matplotlib.pyplot as plt
def plot_residuals(turnstile_weather, predictions):
    '''
    Using the same methods that we used to plot a histogram of entries
    per hour for our data, why don't you make a histogram of the residuals
    (that is, the difference between the original hourly entry data and the predicted values).
    Based on this residual histogram, do you have any insight into how our model
    performed?  Reading a bit on this webpage might be useful:
    http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm
    '''
   
    plt.figure()
    (turnstile_weather['''choose the right column here'''] - predictions).hist()
    return plt


Exercise 7 Compute R Squared

import numpy as np
import scipy
import matplotlib.pyplot as plt
import sys
def compute_r_squared(data, predictions):
    '''
    In exercise 5, we calculated the R^2 value for you. But why don't you try and
    and calculate the R^2 value yourself.
   
    Given a list of original data points, and also a list of predicted data points,
    write a function that will compute and return the coefficient of determination (R^2)
    for this data.  numpy.mean() and numpy.sum() might both be useful here, but
    not necessary.
    Documentation about numpy.mean() and numpy.sum() below:
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
    '''
   
    # your code here
   
    return r_squared


Exercise 8 Non Gradient Decent Linear Regression

import numpy as np
import pandas
import scipy
import statsmodels
"""
In this optional exercise, you should complete the function called
predictions(turnstile_weather). This function takes in our pandas
turnstile weather dataframe, and returns a set of predicted ridership values,
based on the other information in the dataframe. 
You should attempt to implement another type of linear regression,
that you may have read about, such as ordinary least squares regression:
http://en.wikipedia.org/wiki/Ordinary_least_squares
This is your playground. Go wild!
How does your choice of linear regression compare to linear regression
with gradient descent?
You can look at the information contained in the turnstile_weather dataframe below:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
Note: due to the memory and CPU limitation of our amazon EC2 instance, we will
give you a random subset (~15%) of the data contained in turnstile_data_master_with_weather.csv
If you receive a "server has encountered an error" message, that means you are hitting
the 30 second limit that's placed on running your program. See if you can optimize your code so it
runs faster.
"""
def predictions(weather_turnstile):
    #
    # Your implementation goes here. Feel free to write additional
    # helper functions
    #
    return prediction






Friday, June 20, 2014

Lesson 3 Data Analysis

Statistical Rigor
    Significant tests
    - using our data, can we disprove an assumption with a pre-defined level of confidence?

Why is statistics useful?
- They provide a formalized framework for comparing and evaluating data.
- They enable us to evaluate whether perceiving effects in our dataset reflect differences across the whole population.

Statistical Significant Tests
- many tests might make assumptions about data's distribution.
- very common distribution - normal distribution (aka Gaussian distribution, bell curve)

t-test
accept or reject a null hypothesis.
Null hypothesis: a statement we are trying to disprove by running our test.
- two samples can from the same population
- a sample is drawn from a probability distribution
specified in terms of a test statistic

Test Statistic: one number that helps accept or reject the null hypothesis
t test - t

A few different versions depending on assumptions
- equal sample size?
- same variance?

Two sample test

calculate t, calculate nu, calculate p;
p value: probability of obtaining a test statistic at least as extreme as ours if null hypothesis was true.
set p critical, if p<p critical, reject null hypothesis
else cannot reject null hypothesis.

Exercise: Calculate t and Nu

Welch's t-test in Python
Is there a simple way to do this in Python?

import scipy.stats
scipy.stats ttest_ind(list_1, list_2, equal_var_false)

scipy.stats ttest_ind assumes a two-side test. How could we use the output to instead perform a one-side test?
great, p/2 < p ritical, t>0
less, p/2 >p critical, t<0


Exercise Welch's t-test

import numpy
import scipy.stats
import pandas
def compare_averages(filename):
    """
    Performs a t-test on two sets of baseball data (left-handed and right-handed hitters).
    You will be given a csv file that has three columns.  A player's
    name, handedness (L for lefthanded or R for righthanded) and their
    career batting average (called 'avg'). You can look at the csv
    file via the following link:
    https://www.dropbox.com/s/xcn0u2uxm8c4n6l/baseball_data.csv
   
    Write a function that will read that the csv file into a pandas data frame,
    and run Welch's t-test on the two cohorts defined by handedness.
   
    One cohort should be a data frame of right-handed batters. And the other
    cohort should be a data frame of left-handed batters.
   
    We have included the scipy.stats library to help you write
    or implement Welch's t-test:
    http://docs.scipy.org/doc/scipy/reference/stats.html
   
    With a significance level of 95%, if there is no difference
    between the two cohorts, return a tuple consisting of
    True, and then the tuple returned by scipy.stats.ttest. 
   
    If there is a difference, return a tuple consisting of
    False, and then the tuple returned by scipy.stats.ttest.
   
    For example, the tuple that you return may look like:
    (True, (9.93570222, 0.000023))
    """

Non Parametric Tests
A statistical test that does not assume our data is drawn from any particular underlying probability distribution.

Mann-Whitney U test: tests null hypothesis that the two populations are the same.
u,p=scipy.stats.Mannwhitneyu(x,y)

Non-normal Data
Shapro-Wilk test
w,p=scipy.stats.Shapiro(data)

Just a tip of iceberg
Many different statistical methods!
Data scientists have large toolkits.

Predicting Future Data
How might we use the data we've collected to make predictions the data we don't have?

Machine Learning: a branch of artificial intelligence focuses on constructing systems that learn from large amounts of data to make predictions.

Why machine learning is useful?

Statistics vs. Machine Learning
What is the difference between statistics and machine learning?
- Statistics is focused on analyzing existing data, and drawing valid conclusions.
- Machine learning is focused on making predictions.

Types of Machine Learning
Data- Model- Predictions
Supervising Learning: such detect spam emails
- Have examples with input and output;
- Predict output for future, input-only data;
- Classification;
- Regression.

Unsupervised Learning:
- Trying to understand structure of data
- Clustering

Kurt's favorite machine learning algorithm: clustering and CPA.

Predicting HR: example of supervising learning
- can we write an equation that takes a bunch of info (e.g., height, weight, birth year, position, for baseball players) and predict HR (homerun)? Using regression.

Linear Regression with Gradient Decent
If we have age, height, weight, and batting average, and are trying to predict batting average, what are input variables? Age, Height, and Weight.

Gradient Decent
Cost function: J(0), we want to minimize this!

how: apha, learning rate (it's like a decreasing slope rate)

Gradient Decent in Python

import numpy
import pandas
def compute_cost(features, values, theta):
    """
    Compute the cost function given a set of features / values, and values for our thetas.
    """
    m = len(values)
    sum_of_square_errors = numpy.square(numpy.dot(features, theta) - values).sum()
    cost = sum_of_square_errors / (2*m)
    return cost
def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.
    """
    # Write some code here that updates the values of theta a number of times equal to
    # num_iterations.  Everytime you have computed the cost for a given set of thetas,
    # you should append it to cost_history.  The function should return both the final
    # values of theta and the cost history.
    # YOUR CODE GOES HERE
    m=len(values)
    cost_history=[]
   
    for I in range (num_iterations):
         predicted_values=numpy.dot(features, theta)
         theta=theta - alpha/m*numpy.dot((predicted_values - values), features)
  
         cost=compute_cost(features, values, theta)
         cost_history.append(cost)

    return theta, pandas.Series(cost_history)

Good job! Your program and algorithm worked

Theta =
[ 45.35759233  -9.02442042  13.69229668]

Cost History =
0     3748.133469
1     3727.492258
2     3707.261946
3     3687.434249
4     3668.001052
5     3648.954405
6     3630.286519
7     3611.989767
8     3594.056675
9     3576.479921
10    3559.252334
11    3542.366888
12    3525.816700
13    3509.595027
14    3493.695263
...
985    2686.730820
986    2686.730290
987    2686.729764
988    2686.729240
989    2686.728720
990    2686.728203
991    2686.727690
992    2686.727179
993    2686.726672
994    2686.726168
995    2686.725668
996    2686.725170
997    2686.724676
998    2686.724185
999    2686.723697
Length: 1000, dtype: float64


Coefficient of Determination: R squared
data: yi,.......,yn
preductions: fi,......,fn
average of data= y-

R2 (R squared)

Calculating R Squared

import numpy as np
def compute_r_squared(data, predictions):
    # Write a function that, given two input numpy arrays, 'data', and 'predictions,'
    # returns the coefficient of determination, R^2, for the model that produced
    # predictions.
    #
    # Numpy has a couple of functions -- np.mean() and np.sum() --
    # that you might find useful, but you don't have to use them.
    # YOUR CODE GOES HERE
   
    SST=((data-np.mean(data)**2).sum()
    SSReg=((predictions-data)**2).sum()
    r_squared=1-SSReg/SST

    return r_squared


Additional Considerations
- other types of linear regression
   - ordinary least squares regression
- parameter estimation: what are the confidence intervals of our parameters? What is the likelihood we would calculate this parameter value if the parameter had no effect?
- under/ over fitting
- multiple local minima: use different random initial thetas; seed random values for repeatability.

Assignment #3
t-test: does subway ridership more in rainy or weekend?
linear regression with gradient decent: how many subway riders n a particular day and time? time of day, weekday of week, rainy or not rainy day,...