Monday, September 1, 2014

Lesson 4 - Data Visualization

Show your findings in visualization, in Python.

What is information visualization?
1. Effective communication of complex quantitative ideas.
 - clarity, precision, efficiency.
2. Helps you notice things about data (correlations, trends, etc.) that might go unnoticed.
3. Can highlight aspects of data, or "tell a story".

Napoleon's March on Russian.

What information is depicted in this visualization?
 size of army, location of army, direction of army, temperature or various dates during the retreat.

Components of effective visualization:
1.visual cues
2. coordinate systems
3. scale / data types
4. context

Mathematical and statistical rigor. Tell a story.

Visual Encoding (cues)
Position, Length, Angle, Direction, Shape, Area/volume, Color (hue, saturation, combination, limit hue).

Perception of visual cues: 1, position, 2. angle, 3. area, 4. saturation.
1985 AT&T Lab paper on graphical perception:
position, length, angle, direction, area, volumes, saturation, hue.
Hue and saturation least accurate.

Plotting in Python:
Many packages:
1. Matplotlib
2. ggplot

We use ggplot, why? looks nicer, grammar of graphics,
First step: create plot.
Second step: represent data with geometric objects.
Third step: add labels.
print ggplot (data, aes(xvar, yvar))+ geom_point(color='coral')+ geom_line(color='coral')+ ggplot('title')+ xlab('x-lab')+ ylab('y-lab')


from pandas import *
from ggplot import *

import pandas

def lineplot(hr_year_csv):
    # A csv file will be passed in as an argument which
    # contains two columns -- 'HR' (the number of homerun hits)
    # and 'yearID' (the year in which the homeruns were hit).
    #
    # Fill out the body of this function, lineplot, to use the
    # passed-in csv file, hr_year.csv, and create a
    # chart with points connected by lines, both colored 'red',
    # showing the number of HR by year.
    #
    # You will want to first load the csv file into a pandas dataframe
    # and use the pandas dataframe along with ggplot to create your visualization
    #
    # You can check out the data in the csv file at the link below:
    # https://www.dropbox.com/s/awgdal71hc1u06d/hr_year.csv
    #
    # You can read more about ggplot at the following link:
    # https://github.com/yhat/ggplot/
 
 
    gg =
  hr_year = pandas.read_csv('hr_year.csv')
  print ggplot(hr_year, aes('yearID', 'HR')) + geom_point(color='red')+ geom_line(color='red') + ggtitle('Total HRs by Year') + xlab('Year') + ylab('HR')


    return gg


Data Type - Numeric Data
a measurement (e.g. height, weight) and a count (e.g. HR or hits)
discrete and continuous:
discrete - a certain numbers, whole number (10, 34, 25)
continuous - any numbers in a range (e.g. 0.25, 0.357, 0.511 batting rate)


Data Type - Categorical Data
represent characteristic data (e.g. position, team, hometown, handedness)
can take numerical values but they don't have mathematical meaning.
ordinal data - category with some order or ranking power.
Very Low, Low, High, Very High.


Data Type - Time Series Data
Data collected via repeated measurements over time.
Example: Average HR/player over many years.


Scales
Categorical
HR vs. Position
HR vs. Months/Years

Improper use of scales
When use incorrectly, scales can misguide or confuse readers.


Plotting Line Chart
from pandas import *
from ggplot import *

import pandas

def lineplot_compare(hr_by_team_year_sf_la_csv):
    # Write a function, lineplot_compare, that will read a csv file
    # called hr_by_team_year_sf_la.csv and plot it using pandas and ggplot2.
    #
    # This csv file has three columns: yearID, HR, and teamID. The data in the
    # file gives the total number of home runs hit each year by the SF Giants
    # (teamID == 'SFN') and the LA Dodgers (teamID == "LAN"). Produce a
    # visualization comparing the total home runs by year of the two teams.
    #
    # You can see the data in hr_by_team_year_sf_la_csv
    # at the link below:
    # https://www.dropbox.com/s/wn43cngo2wdle2b/hr_by_team_year_sf_la.csv
    #
    # Note that to differentiate between multiple categories on the
    # same plot in ggplot, we can pass color in with the other arguments
    # to aes, rather than in our geometry functions. For example,
    # ggplot(data, aes(xvar, yvar, color=category_var)). This should help you
    # in this exercise.
 
    gg = #YOUR CODE GOES HERE

  hr_year = pandas.read_csv('hr_by_team_year_sf_la.csv')
  print ggplot(hr_year, aes('yearID', 'HR', color='teamID')) + geom_point()+ geom_line() + ggtitle('Total HRs by Year') + xlab('Year') + ylab('HR')

if _name__==  '__main__':
lineplot_compare()

    return gg


Visualizing Time Series Data
- Format for MTA + Weather Data
   - Scatter Plot vs. Additional Bells & Whistle (e.g. LOESS curve)
- Boston Red Sox Winning Percentage, 1960 - 2010

Scatter Plot

Line Chart
- mitigate some shortening of scatter plot
- emphasize the trends
- focus on the year to year variability, not overall trends.

LOESS Curve
- emphasize long term trends
- LOESS weighted regression
- easier to take a quick look at chart and understand big picture.

Multivariate
- how to incorporate more variables?
- use an additional encoding,
   -size,
   - color/saturation


Google Data Visualization to find more blog and materials.






No comments:

Post a Comment