Teaching Materials

POL 211 - Introductory Graduate Methods

Here you can find discussion section materials for POL 211 (Fall 2025) at UC Davis. This content is adapted from material prepared by and shared from Professors Yu-Shiuan (Lily) Huang (National Chengchi University) and Ryan Hubert (London School of Economics) for prior iterations of this course.

Note: These materials were adapted from content prepared by Professors Yu-Shiuan (Lily) Huang (National Chengchi University) and Ryan Hubert (London School of Economics).

For today’s discussion, we’ll first introduce LaTeX and R Markdown/Quarto, which are two very useful typesetting systems for you to generate neat documents for your problem sets and research projects. The second part of discussion will review some basic R skills when exploring a new dataset.

1. LaTeX

LaTeX is a high-quality typesetting system; it includes features designed for the production of technical and scientific documentation. LaTeX is the de facto standard for the communication and publication of scientific documents.

If you prefer not to download the LaTeX compiler and editor to your laptop, you can use Overleaf, which is an online LaTeX editor that is very easy to use and work on projects with multiple collaborators. It also provides many templates for you to choose from. Here is a brief introduction video of how to use Overleaf.

  • To create a nice table output in Overleaf, TablesGenerator is very helpful. You can also use the stargazer package in R to create regression tables in LaTeX form and copy them into Overleaf. Here is a handout of how to use the stargazer package (we will talk more on this in the future).
  • To insert math equations in Overleaf, LaTeX Equation Editor would be your go-to.

I will send around an Overleaf template for the first problem set if you want to use it for your work. Alternatively…

2. Quarto/R Markdown

Another way to submit both your answers and R code with only one pdf document is by using R Markdown or Quarto. As R Markdown is no longer be updated, we will be working primarily using Quarto and Quarto Documents (.qmd files).

Quarto is a markup language that allows you to create dynamic documents in R programming language, as well as other languages. With Quarto, you can easily combine text, code, and visualizations in a single document. Quarto documents can be rendered in a variety of output formats, including HTML, PDF, Microsoft Word, and more.

Quarto provides a way to weave together narrative text, code, and results into a single document that can be easily shared and reproduced. It allows you to write code and see the results of that code inline, making it easy to explain complex statistical analyses or data visualizations.

Here is a thorough introduction of how to use Quarto.

For the following topic on exploring a new dataset using R, all the materials will be shown in Quarto format. You are also more than welcome to upload your problem set:

  • by knitting your Quarto document to html first, then open it in a browser, and then save it as a pdf.(Just click the Knit button!)
  • by knitting your Quarto document to pdf directly if you have already downloaded a LaTeX compiler in your computer.

When knitting to pdf, all of the contents in Quarto will be presented in LaTeX format automatically. If you don’t have a LaTeX compiler installed on your machine, you will encounter an error message when attempting to knit your Quarto to PDF. If you need more help, here is a guide I found online that provides step-by-step instructions on how to do it.

I will also be uploading a Quarto template for the problem sets to the Canvas page.

3. Exploring a New Dataset using R (Part I)

a. Load R packages

There are many useful packages in R that can be installed freely. If it is the first time you are going to use a package, here are the steps:

  • Install it
  • library it (which means telling R you are going to use it)

Once you have installed the package, you don’t have to install it every time you open R studio. However, you still have to load it through the library function every time.

# Install packages
# install.packages("tidyverse") # manipulate data 
# install.packages("Matrix") # calculate rank
## these are the packages we will use for today 

# load packages
library(tidyverse)
library(Matrix) 

The tidyverse is a powerful collection of R packages that are actually data tools for transforming and visualizing data. All packages of the tidyverse share an underlying philosophy and common APls.

The core packages within the tidyverse are:

  • ggplot2, which implements the grammar of graphics. You can use it to visualize your data.
  • dplyr is a grammar of data. You can use it to solve the most common data manipulation challenges.
  • tidyr helps you to create tidy data or data where each variable is in a column, each observation is a row end each value is a column, each observation is a row end each value is a cell.
  • readr is a fast and friendly way to read rectangular
  • purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors.
  • tibble is a modern re-imaginging of the data
  • stringr provides a cohesive set of functions designed to make working with strings as easy as possible
  • forcats provide a suite of useful tools that solve common problems with factors.

Highly recommend this data wrangling cheat sheet with dplyr and tidyr in R!

b. Import data

For today, we are using USDC-DATASET-JOP.csv. You can download this csv file from Canvas (in the Discussion Section 2025 folder) or from here.

# Import data
df <- read_csv("POL 211/data/USDC-DATASET-JOP.csv") # Here we assign the dataset as an object called “df”

## Note: you create new objects by assigning them with the `<-` phrase. You can also use an equal sign `=`.

Typically, we want to store all of our data, files, and code in an R Project. This avoids a lot of the messiness of working directories, file paths, and losing track of where you saved things.

However, if you are working outside an R Project, it might be helpful to understand how working directories work!

Here are some tips for how to find the file path of the data you are going to use.

  • First, where did you store your data? In which folder? In my case, I stored the data in the data folder within the POL 211 folder, all within the folder for this R Project.

  • Second, use getwd function to get the grasp of how does a file path look like. getwd function returns an absolute file path representing the current working directory of the R process.

  getwd()
[1] "C:/Users/alexc/Desktop/Website Session/cohen-site"

As you can see, my current working directory is "C:/Users/alexc/Desktop/Website Session/cohen-site", which is where this website and R Project is stored on my computer. Since I know that I stored the data in the data folder, which is in the POL 211 folder, I can simply just change the file path to ""C:/Users/alexc/Desktop/Website Session/cohen-site/POL 211/data/USDC-DATASET-JOP.csv" and put it in the read_csv function so R knows where to read the file you want.

I can also manually change the working directory to tell R which folder on my computer I want to be importing and exporting files to. You can do this with the setwd() command, or through the drop down menu (Session, Set Working Directory, Choose Directory)

If you import the data successfully, you should be able to see df popping up in the environment section.

Sometimes the dataset we are interested in may be stored in a different form, such as dta, spss, xlsx, rds, rdata, etc. You can find the corresponded codes from here.

c. Basic information about the dataset

What is the size of the dataset?

# the number of row is:
nrow(df)
[1] 97725

How many observations (rows) and variables (columns) we have in the dataset?

# the number of column is:
ncol(df)
[1] 54

You can also use dim function to get the size of the dataset.

dim(df) 
[1] 97725    54
## this function gives you the number of rows and columns at the same time

The size of this dataset is 97,725 x 54. As a reminder, datasets read into R through tidyverse (also known as tibbles) tell you this information within the console.

df
# A tibble: 97,725 × 54
   file     OUTCOME jrepublican jpresident jid_anon jyears APPEAL judge_replaced
   <chr>    <chr>         <dbl> <chr>         <dbl>  <dbl>  <dbl>          <dbl>
 1 /docket… jud_de…           1 Reagan        90059     13      0              0
 2 /docket… dism_o…           1 Reagan        90048      9      0              0
 3 /docket… dism_p…           1 Reagan        90012     11      0              0
 4 /docket… dism_o…           1 Bush41        90016      3      0              0
 5 /docket… dism_v…           1 Reagan        90096     10      0              0
 6 /docket… dism_v…           1 Reagan        90158     11      0              0
 7 /docket… settle…           1 Reagan        90158     11      0              0
 8 /docket… settle…           1 Ford          90218     19      0              0
 9 /docket… jud_mo…           1 Nixon         90148     24      0              0
10 /docket… dism_j…           1 Reagan        90005     13      0              0
# ℹ 97,715 more rows
# ℹ 46 more variables: jsenior <dbl>, court <chr>, division <dbl>, YEAR <dbl>,
#   QUARTER <chr>, nature_of_suit <dbl>, SECTION <chr>, ORIGIN <dbl>,
#   JURIS <chr>, jury_demand <chr>, PROSE <dbl>, COUNTY <chr>, pla_count <dbl>,
#   pla_count_prose <dbl>, pla_count_anonymous <dbl>, pla_count_IND <dbl>,
#   pla_count_BUS <dbl>, pla_count_GOV <dbl>, pla_count_LAW <dbl>,
#   pla_count_OTH <dbl>, pla_count_repeat <dbl>, pla_acount <dbl>, …

If you want to look at the whole dataset, use view(df) command and it will open in a new tab in R. (Note: The view(df) and View(df) functions will both achieve this. The lowercase view is part of tidyverse, and the uppercase comes from base R.)

d. Exploring the variables

Most people prefer to look at variables using their names.

How to ask R to tell us the name of each variables?

# Name of each column
colnames(df)
 [1] "file"                "OUTCOME"             "jrepublican"        
 [4] "jpresident"          "jid_anon"            "jyears"             
 [7] "APPEAL"              "judge_replaced"      "jsenior"            
[10] "court"               "division"            "YEAR"               
[13] "QUARTER"             "nature_of_suit"      "SECTION"            
[16] "ORIGIN"              "JURIS"               "jury_demand"        
[19] "PROSE"               "COUNTY"              "pla_count"          
[22] "pla_count_prose"     "pla_count_anonymous" "pla_count_IND"      
[25] "pla_count_BUS"       "pla_count_GOV"       "pla_count_LAW"      
[28] "pla_count_OTH"       "pla_count_repeat"    "pla_acount"         
[31] "pla_acount_repeat"   "def_count"           "def_count_prose"    
[34] "def_count_anonymous" "def_count_IND"       "def_count_BUS"      
[37] "def_count_GOV"       "def_count_LAW"       "def_count_OTH"      
[40] "def_count_repeat"    "def_acount"          "def_acount_repeat"  
[43] "oth_count"           "oth_count_prose"     "oth_count_anonymous"
[46] "oth_count_IND"       "oth_count_BUS"       "oth_count_GOV"      
[49] "oth_count_LAW"       "oth_count_OTH"       "oth_count_repeat"   
[52] "oth_acount"          "oth_acount_repeat"   "ifp_denial"         

How to call out one specific variable we are interested in from the dataset? There are multiple ways to look at an arbitrary variable:

##tidyverse

df %>% 
  select(OUTCOME)

##base R

## Use column number
df[, 2] 
#Alternatively, df[[2]]

## Use variable name
df[, "OUTCOME"]
#Alternatively, df[["OUTCOME"]]

## Calling a variable using $
df$OUTCOME ## This is the most common way to do it

What kind of variable is the OUTCOME variable? It looks like a categorical variable! We can (and often should) check the variable class.

##tidyverse
df %>% 
  select(OUTCOME) %>% 
  map_chr(type_sum)
OUTCOME 
  "chr" 
##base R
class(df$OUTCOME)
[1] "character"
#OR

type_sum(df$OUTCOME)
[1] "chr"

The next question would be what are the categories (called “levels” in R) of this variable?

# This asks R to give us a list of all the values used in this variable.

##tidyverse
df %>% 
  distinct(OUTCOME)
# A tibble: 34 × 1
   OUTCOME              
   <chr>                
 1 jud_default_plaintiff
 2 dism_other           
 3 dism_pros            
 4 dism_vol             
 5 settlement           
 6 jud_motion_defendant 
 7 dism_juris           
 8 remand               
 9 jud_misc_NA          
10 jud_misc_both        
# ℹ 24 more rows
##base R
unique(df$OUTCOME)
 [1] "jud_default_plaintiff"  "dism_other"             "dism_pros"             
 [4] "dism_vol"               "settlement"             "jud_motion_defendant"  
 [7] "dism_juris"             "remand"                 "jud_misc_NA"           
[10] "jud_misc_both"          "jud_motion_plaintiff"   "jud_misc_defendant"    
[13] "jud_jury_defendant"     "transfer"               "jud_bench_defendant"   
[16] "jud_misc_plaintiff"     "jud_consent_plaintiff"  "jud_jury_plaintiff"    
[19] "jud_default_both"       "jud_motion_both"        "jud_default_defendant" 
[22] "jud_bench_plaintiff"    "jud_jury_both"          "jud_consent_both"      
[25] "jud_bench_both"         "jud_directed_defendant" "jud_consent_defendant" 
[28] "jud_motion_NA"          "jud_jury_NA"            "jud_bench_NA"          
[31] "jud_consent_NA"         "jud_directed_plaintiff" "jud_default_NA"        
[34] "jud_directed_NA"       

There are 34 categories in this variable.

Here’s an issue: when R loaded the dataset, it didn’t code this variable as a categorical variable. Since the OUTCOME variable is coded as strings/texts in the dataset, in R, the type of this variable is character.

# Look at what this says about the variable type:
type_sum(df$OUTCOME)
[1] "chr"
# When you summarize it in R, R doesn’t provide you with meaningful information
summary(df$OUTCOME)
   Length     Class      Mode 
    97725 character character 

Good news is that we can transform the type of a variable in R.

# lets's change the type to a categorical variable, known as a "factor" variable in R:

##tidyverse
df <- df %>% 
  mutate(OUTCOME = as.factor(OUTCOME))

##base R
df$OUTCOME <- as.factor(df$OUTCOME) 

## In order to override the original variable, we have to assign the new version of the variable to the original one.

# After factorizing, R can summarize in a more meaningful way: what is the number of each category?
summary(df$OUTCOME)
            dism_juris             dism_other              dism_pros 
                  1321                  16198                   3123 
              dism_vol         jud_bench_both    jud_bench_defendant 
                 17805                     10                    192 
          jud_bench_NA    jud_bench_plaintiff       jud_consent_both 
                    22                    110                    199 
 jud_consent_defendant         jud_consent_NA  jud_consent_plaintiff 
                    48                     93                    391 
      jud_default_both  jud_default_defendant         jud_default_NA 
                     5                     69                      7 
 jud_default_plaintiff jud_directed_defendant        jud_directed_NA 
                   228                     34                      1 
jud_directed_plaintiff          jud_jury_both     jud_jury_defendant 
                     6                     73                   1260 
           jud_jury_NA     jud_jury_plaintiff          jud_misc_both 
                   121                    528                    399 
    jud_misc_defendant            jud_misc_NA     jud_misc_plaintiff 
                  1355                   2239                    413 
       jud_motion_both   jud_motion_defendant          jud_motion_NA 
                   428                  10510                    821 
  jud_motion_plaintiff                 remand             settlement 
                   629                   4546                  32627 
              transfer 
                  1914 
e. Exploring the observations

Each row (observation) in this dataset is a court case. We can also use R to look at an arbitrary one.

df[5,] # This means that we want to call out the fifth row of the data.
# A tibble: 1 × 54
  file      OUTCOME jrepublican jpresident jid_anon jyears APPEAL judge_replaced
  <chr>     <fct>         <dbl> <chr>         <dbl>  <dbl>  <dbl>          <dbl>
1 /dockets… dism_v…           1 Reagan        90096     10      0              0
# ℹ 46 more variables: jsenior <dbl>, court <chr>, division <dbl>, YEAR <dbl>,
#   QUARTER <chr>, nature_of_suit <dbl>, SECTION <chr>, ORIGIN <dbl>,
#   JURIS <chr>, jury_demand <chr>, PROSE <dbl>, COUNTY <chr>, pla_count <dbl>,
#   pla_count_prose <dbl>, pla_count_anonymous <dbl>, pla_count_IND <dbl>,
#   pla_count_BUS <dbl>, pla_count_GOV <dbl>, pla_count_LAW <dbl>,
#   pla_count_OTH <dbl>, pla_count_repeat <dbl>, pla_acount <dbl>,
#   pla_acount_repeat <dbl>, def_count <dbl>, def_count_prose <dbl>, …
# If you want to see all the variables for that observation, here are some options:
t(df[5,]) ## downside: this treats all the values like strings!
                    [,1]                                  
file                "/dockets-cacd/1995/199501-00068.html"
OUTCOME             "dism_vol"                            
jrepublican         "1"                                   
jpresident          "Reagan"                              
jid_anon            "90096"                               
jyears              "10"                                  
APPEAL              "0"                                   
judge_replaced      "0"                                   
jsenior             "0"                                   
court               "cacd"                                
division            "2"                                   
YEAR                "1995"                                
QUARTER             "1995Q1"                              
nature_of_suit      "442"                                 
SECTION             "1983"                                
ORIGIN              "1"                                   
JURIS               "federal_question"                    
jury_demand         "Plaintiff"                           
PROSE               NA                                    
COUNTY              "06037"                               
pla_count           "1"                                   
pla_count_prose     "0"                                   
pla_count_anonymous "0"                                   
pla_count_IND       "0.7538"                              
pla_count_BUS       "0.1453"                              
pla_count_GOV       "0.0485"                              
pla_count_LAW       "0.0315"                              
pla_count_OTH       "0.021"                               
pla_count_repeat    "0"                                   
pla_acount          "1"                                   
pla_acount_repeat   "1"                                   
def_count           "4"                                   
def_count_prose     "0"                                   
def_count_anonymous "1"                                   
def_count_IND       "2.7986"                              
def_count_BUS       "0.7653"                              
def_count_GOV       "0.2055"                              
def_count_LAW       "0.1402"                              
def_count_OTH       "0.0903"                              
def_count_repeat    "1"                                   
def_acount          "0"                                   
def_acount_repeat   "0"                                   
oth_count           "0"                                   
oth_count_prose     "0"                                   
oth_count_anonymous "0"                                   
oth_count_IND       "0"                                   
oth_count_BUS       "0"                                   
oth_count_GOV       "0"                                   
oth_count_LAW       "0"                                   
oth_count_OTH       "0"                                   
oth_count_repeat    "0"                                   
oth_acount          "0"                                   
oth_acount_repeat   "0"                                   
ifp_denial          "0"                                   
as.data.frame(df[5,]) ## downside: sort of ugly and hard to see
                                  file  OUTCOME jrepublican jpresident jid_anon
1 /dockets-cacd/1995/199501-00068.html dism_vol           1     Reagan    90096
  jyears APPEAL judge_replaced jsenior court division YEAR QUARTER
1     10      0              0       0  cacd        2 1995  1995Q1
  nature_of_suit SECTION ORIGIN            JURIS jury_demand PROSE COUNTY
1            442    1983      1 federal_question   Plaintiff    NA  06037
  pla_count pla_count_prose pla_count_anonymous pla_count_IND pla_count_BUS
1         1               0                   0        0.7538        0.1453
  pla_count_GOV pla_count_LAW pla_count_OTH pla_count_repeat pla_acount
1        0.0485        0.0315         0.021                0          1
  pla_acount_repeat def_count def_count_prose def_count_anonymous def_count_IND
1                 1         4               0                   1        2.7986
  def_count_BUS def_count_GOV def_count_LAW def_count_OTH def_count_repeat
1        0.7653        0.2055        0.1402        0.0903                1
  def_acount def_acount_repeat oth_count oth_count_prose oth_count_anonymous
1          0                 0         0               0                   0
  oth_count_IND oth_count_BUS oth_count_GOV oth_count_LAW oth_count_OTH
1             0             0             0             0             0
  oth_count_repeat oth_acount oth_acount_repeat ifp_denial
1                0          0                 0          0
## To draw out an element from a data, you have to specify a row number and a column number.
df[5, 1]
# A tibble: 1 × 1
  file                                
  <chr>                               
1 /dockets-cacd/1995/199501-00068.html

Usually, you only need to see some of the variables. For example: suppose we want to know which judge heard this case (jid_anon), what the case outcome was (OUTCOME) and the party of the judge’s appointing president (jrepublican).

df[5, c("jid_anon", "OUTCOME", "jrepublican")]
# A tibble: 1 × 3
  jid_anon OUTCOME  jrepublican
     <dbl> <fct>          <dbl>
1    90096 dism_vol           1

A more neat way to write the code by using tidyverse package.

# using pipes to do the same thing; use select function from tidyverse to pick variables 
df[5, ] %>% select(jid_anon, OUTCOME, jrepublican)
# A tibble: 1 × 3
  jid_anon OUTCOME  jrepublican
     <dbl> <fct>          <dbl>
1    90096 dism_vol           1

%>% is called “pipes.” Pipes take the output from one function and feed it to the first argument of the next function.

We can also find a row by “filtering” the data based on some criteria.

What if I want to find a case by its ID number? Eg, “/dockets-cacd/2007/200738-00195.html”

# use filter function from tidyverse to set the condition 
df %>% filter(file == "/dockets-cacd/2007/200738-00195.html")
# A tibble: 1 × 54
  file      OUTCOME jrepublican jpresident jid_anon jyears APPEAL judge_replaced
  <chr>     <fct>         <dbl> <chr>         <dbl>  <dbl>  <dbl>          <dbl>
1 /dockets… settle…           1 Bush43        90104      0      0              0
# ℹ 46 more variables: jsenior <dbl>, court <chr>, division <dbl>, YEAR <dbl>,
#   QUARTER <chr>, nature_of_suit <dbl>, SECTION <chr>, ORIGIN <dbl>,
#   JURIS <chr>, jury_demand <chr>, PROSE <dbl>, COUNTY <chr>, pla_count <dbl>,
#   pla_count_prose <dbl>, pla_count_anonymous <dbl>, pla_count_IND <dbl>,
#   pla_count_BUS <dbl>, pla_count_GOV <dbl>, pla_count_LAW <dbl>,
#   pla_count_OTH <dbl>, pla_count_repeat <dbl>, pla_acount <dbl>,
#   pla_acount_repeat <dbl>, def_count <dbl>, def_count_prose <dbl>, …

What if I want to find all cases heard by judges appointed by Bush 43?

df %>% filter(jpresident == "Bush43")
# A tibble: 17,622 × 54
   file     OUTCOME jrepublican jpresident jid_anon jyears APPEAL judge_replaced
   <chr>    <fct>         <dbl> <chr>         <dbl>  <dbl>  <dbl>          <dbl>
 1 /docket… jud_mo…           1 Bush43        90131      1      0              0
 2 /docket… dism_v…           1 Bush43        90131      0      0              0
 3 /docket… dism_p…           1 Bush43        90131      0      0              0
 4 /docket… dism_j…           1 Bush43        90131      0      0              0
 5 /docket… dism_o…           1 Bush43        90159      0      0              0
 6 /docket… dism_p…           1 Bush43        90159      0      0              0
 7 /docket… dism_o…           1 Bush43        90159      0      0              0
 8 /docket… dism_o…           1 Bush43        90159      0      0              0
 9 /docket… dism_o…           1 Bush43        90131      0      1              0
10 /docket… dism_o…           1 Bush43        90159      0      1              1
# ℹ 17,612 more rows
# ℹ 46 more variables: jsenior <dbl>, court <chr>, division <dbl>, YEAR <dbl>,
#   QUARTER <chr>, nature_of_suit <dbl>, SECTION <chr>, ORIGIN <dbl>,
#   JURIS <chr>, jury_demand <chr>, PROSE <dbl>, COUNTY <chr>, pla_count <dbl>,
#   pla_count_prose <dbl>, pla_count_anonymous <dbl>, pla_count_IND <dbl>,
#   pla_count_BUS <dbl>, pla_count_GOV <dbl>, pla_count_LAW <dbl>,
#   pla_count_OTH <dbl>, pla_count_repeat <dbl>, pla_acount <dbl>, …

The reason why we need to add quotation mark when setting the condition in the filter function is that both of the file and jpresident variables are coded in character type (categorical) in R. If we filter our cases based on whether the judge of the case is assigned by a Republican president (when jrepublican == 1), we don’t have to add quotation mark in the condition as this variable is coded as the type of double (dbl) in R, which is a class of numeric.

df %>% filter(jrepublican == 1)
# A tibble: 42,679 × 54
   file     OUTCOME jrepublican jpresident jid_anon jyears APPEAL judge_replaced
   <chr>    <fct>         <dbl> <chr>         <dbl>  <dbl>  <dbl>          <dbl>
 1 /docket… jud_de…           1 Reagan        90059     13      0              0
 2 /docket… dism_o…           1 Reagan        90048      9      0              0
 3 /docket… dism_p…           1 Reagan        90012     11      0              0
 4 /docket… dism_o…           1 Bush41        90016      3      0              0
 5 /docket… dism_v…           1 Reagan        90096     10      0              0
 6 /docket… dism_v…           1 Reagan        90158     11      0              0
 7 /docket… settle…           1 Reagan        90158     11      0              0
 8 /docket… settle…           1 Ford          90218     19      0              0
 9 /docket… jud_mo…           1 Nixon         90148     24      0              0
10 /docket… dism_j…           1 Reagan        90005     13      0              0
# ℹ 42,669 more rows
# ℹ 46 more variables: jsenior <dbl>, court <chr>, division <dbl>, YEAR <dbl>,
#   QUARTER <chr>, nature_of_suit <dbl>, SECTION <chr>, ORIGIN <dbl>,
#   JURIS <chr>, jury_demand <chr>, PROSE <dbl>, COUNTY <chr>, pla_count <dbl>,
#   pla_count_prose <dbl>, pla_count_anonymous <dbl>, pla_count_IND <dbl>,
#   pla_count_BUS <dbl>, pla_count_GOV <dbl>, pla_count_LAW <dbl>,
#   pla_count_OTH <dbl>, pla_count_repeat <dbl>, pla_acount <dbl>, …
f. Some measurement issues

What kinds of variable do we have?

There are slight differences between the way a computer stores/understands data and the way we conceptualize it. In this dataset, we have some variables on cardinal scales.

For example: How many plaintiffs are named in each case? (Why is this on a cardinal scale?)

head(df$pla_count)
[1] 1 1 1 1 1 1

What is the data type? It’s “dbl” which means it’s a number that need not be an integer (can be fractional number, 3.51)

type_sum(df$pla_count)
[1] "dbl"
#Remember, we can also do this through tidyverse, although it takes more code
df %>% 
  select(pla_count) %>% 
  map_chr(type_sum)
pla_count 
    "dbl" 
#If we use the base R class function here, we get a slightly different result

class(df$pla_count)
[1] "numeric"

Note: this variable is actually comprised of integers, we could convert it if we want, but this doesn’t really matter. (Why? 11.0 = 11)

df$pla_count <- as.integer(df$pla_count)
type_sum(df$pla_count)
[1] "int"
class(df$pla_count)
[1] "integer"

We also have some variables that are ordinal.

For example, in what years were cases filed? (Why is this on an ordinal scale?)

head(df$YEAR)
[1] 1995 1995 1995 1995 1995 1995

What is the data type? It’s also “dbl”.

type_sum(df$YEAR)
[1] "dbl"

Acknowledge, this is bad! We do not want R to treat this as a cardinal variable.

For example, it (incorrectly) allows us to add them:

df$YEAR[1] + df$YEAR[2]
[1] 3990
## Notice here the change in indexing! Remember that when you call out one variable from the data, it is a vector. To draw out an element from a vector, you only have to specify one number inside [ ].

#If we wanted to do this in tidyverse...

df %>% 
  select(YEAR) %>% 
  slice(1,2) %>% 
  sum()
[1] 3990

So, we should change it to a factor variable so that we don’t make this mistake.

df$YEAR <- as.factor(df$YEAR)
df$YEAR[1] + df$YEAR[2] 
[1] NA
#tidyverse
df <- df %>% 
  mutate(YEAR = as.factor(YEAR))

We got an error when we tried to do something silly, which is good!

In the analysis for Ryan’s paper, they had to convert the OUTCOME variable to dummies. How can we do it in R?

Recall that there are 34 unique categories in the OUTCOME variable.

unique(df$OUTCOME) 
 [1] jud_default_plaintiff  dism_other             dism_pros             
 [4] dism_vol               settlement             jud_motion_defendant  
 [7] dism_juris             remand                 jud_misc_NA           
[10] jud_misc_both          jud_motion_plaintiff   jud_misc_defendant    
[13] jud_jury_defendant     transfer               jud_bench_defendant   
[16] jud_misc_plaintiff     jud_consent_plaintiff  jud_jury_plaintiff    
[19] jud_default_both       jud_motion_both        jud_default_defendant 
[22] jud_bench_plaintiff    jud_jury_both          jud_consent_both      
[25] jud_bench_both         jud_directed_defendant jud_consent_defendant 
[28] jud_motion_NA          jud_jury_NA            jud_bench_NA          
[31] jud_consent_NA         jud_directed_plaintiff jud_default_NA        
[34] jud_directed_NA       
34 Levels: dism_juris dism_other dism_pros dism_vol ... transfer

When we want to make each category of this variable into dummies, it means that we need to create 34 new dummy variables (34 new columns for the dataset). For example, the new dummy variable dism_other will be coded as 1 if the OUTCOME variable is coded as dism_other. Let’s use ifesle function to create this dummy!

df$dism_other <- ifelse(df$OUTCOME == "dism_other", 1, 0)

#tidyverse

df <- df %>% 
  mutate(dism_other = case_when(OUTCOME == "dism_other" ~ 1, 
                                TRUE ~ 0))

Let’s check if we did it correctly.

df %>% select(OUTCOME, dism_other) %>% head()
# A tibble: 6 × 2
  OUTCOME               dism_other
  <fct>                      <dbl>
1 jud_default_plaintiff          0
2 dism_other                     1
3 dism_pros                      0
4 dism_other                     1
5 dism_vol                       0
6 dism_vol                       0

But there is a total of 34 categories in the OUTCOME variable, it would be a waste of time to generate all 34 dummies if we have to retype the above one by one for each category. Luckily, R is here to help us! We can run a forloop function to ask R to do the same thing for all the 34 categories.

for(v in levels(df$OUTCOME)){
  print(v)
  df[[v]] <- ifelse(df$OUTCOME == v, 1, 0) ## df[[v]] is a way to call out/create a variable
} ## This is called "forloop."
[1] "dism_juris"
[1] "dism_other"
[1] "dism_pros"
[1] "dism_vol"
[1] "jud_bench_both"
[1] "jud_bench_defendant"
[1] "jud_bench_NA"
[1] "jud_bench_plaintiff"
[1] "jud_consent_both"
[1] "jud_consent_defendant"
[1] "jud_consent_NA"
[1] "jud_consent_plaintiff"
[1] "jud_default_both"
[1] "jud_default_defendant"
[1] "jud_default_NA"
[1] "jud_default_plaintiff"
[1] "jud_directed_defendant"
[1] "jud_directed_NA"
[1] "jud_directed_plaintiff"
[1] "jud_jury_both"
[1] "jud_jury_defendant"
[1] "jud_jury_NA"
[1] "jud_jury_plaintiff"
[1] "jud_misc_both"
[1] "jud_misc_defendant"
[1] "jud_misc_NA"
[1] "jud_misc_plaintiff"
[1] "jud_motion_both"
[1] "jud_motion_defendant"
[1] "jud_motion_NA"
[1] "jud_motion_plaintiff"
[1] "remand"
[1] "settlement"
[1] "transfer"
#we could also do this using tidyverse
#in this case, it's more complicated and has some weirder syntax

for(v in levels(df$OUTCOME)){
  print(v)
  df <- df %>% 
    mutate({{v}} := case_when(OUTCOME == {{v}} ~ 1,
                         TRUE ~ 0))
}
[1] "dism_juris"
[1] "dism_other"
[1] "dism_pros"
[1] "dism_vol"
[1] "jud_bench_both"
[1] "jud_bench_defendant"
[1] "jud_bench_NA"
[1] "jud_bench_plaintiff"
[1] "jud_consent_both"
[1] "jud_consent_defendant"
[1] "jud_consent_NA"
[1] "jud_consent_plaintiff"
[1] "jud_default_both"
[1] "jud_default_defendant"
[1] "jud_default_NA"
[1] "jud_default_plaintiff"
[1] "jud_directed_defendant"
[1] "jud_directed_NA"
[1] "jud_directed_plaintiff"
[1] "jud_jury_both"
[1] "jud_jury_defendant"
[1] "jud_jury_NA"
[1] "jud_jury_plaintiff"
[1] "jud_misc_both"
[1] "jud_misc_defendant"
[1] "jud_misc_NA"
[1] "jud_misc_plaintiff"
[1] "jud_motion_both"
[1] "jud_motion_defendant"
[1] "jud_motion_NA"
[1] "jud_motion_plaintiff"
[1] "remand"
[1] "settlement"
[1] "transfer"
ncol(df)
[1] 88

You can find that the number of columns increased to 88 from 54 after creating 34 dummies.

g. The rank of a matrix

The rank of a dataset is the number of “linearly independent” variables.

library(Matrix) ## This package has a function for calculating the rank of a matrix

# Let's create a fake dataset
df2 <- data.frame(var1 = seq(1, 100, 3)) ## Creates 34 rows

df2$var2 <- (df2$var1/pi) ## create a new variable called var2 in df2
pi ## pi is an inbuilt R constant whose value is 3.141593
[1] 3.141593
df2$var3 <- 6 
df2
   var1       var2 var3
1     1  0.3183099    6
2     4  1.2732395    6
3     7  2.2281692    6
4    10  3.1830989    6
5    13  4.1380285    6
6    16  5.0929582    6
7    19  6.0478878    6
8    22  7.0028175    6
9    25  7.9577472    6
10   28  8.9126768    6
11   31  9.8676065    6
12   34 10.8225361    6
13   37 11.7774658    6
14   40 12.7323954    6
15   43 13.6873251    6
16   46 14.6422548    6
17   49 15.5971844    6
18   52 16.5521141    6
19   55 17.5070437    6
20   58 18.4619734    6
21   61 19.4169031    6
22   64 20.3718327    6
23   67 21.3267624    6
24   70 22.2816920    6
25   73 23.2366217    6
26   76 24.1915513    6
27   79 25.1464810    6
28   82 26.1014107    6
29   85 27.0563403    6
30   88 28.0112700    6
31   91 28.9661996    6
32   94 29.9211293    6
33   97 30.8760590    6
34  100 31.8309886    6
#tidyverse
df2 <- tibble(var1 = seq(1, 100, 3))
df2 <- df2 %>% 
  mutate(var2 = var1/pi, var3 = 6)
df2
# A tibble: 34 × 3
    var1  var2  var3
   <dbl> <dbl> <dbl>
 1     1 0.318     6
 2     4 1.27      6
 3     7 2.23      6
 4    10 3.18      6
 5    13 4.14      6
 6    16 5.09      6
 7    19 6.05      6
 8    22 7.00      6
 9    25 7.96      6
10    28 8.91      6
# ℹ 24 more rows

What is the rank of this matrix (the fake dataset we created)?

rankMatrix(as.matrix(df2)) ## The input must be a numeric matrix, which means we have to covert df2 from a dataframe to a matrix
[1] 2
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 7.549517e-15

There are 3 variables in df2, but there are only 2 linearly independent variables since the generation of var2 is based on var1.

Download .qmd for Discussion Section I

Note: These materials were adapted from content prepared by Professors Yu-Shiuan (Lily) Huang (National Chengchi University) and Ryan Hubert (London School of Economics).

2. Exploring a New Dataset using R (Part II)

For today, we are using USDC-DATASET-JOP.csv. You can download this csv file from Canvas or from here

## Load packages
library(tidyverse)

## Import data
df <- read_csv("POL 211/data/USDC-DATASET-JOP.csv")
a. Summary Statistics

In the following, we will compute summary statistics for the univariate distribution of the variable that quantifies the number of defendants named in each court case (it’s named def_count).

  1. Calculate the mean

    ## This is the easy way to calculate the mean:
    mean(df$def_count, na.rm = TRUE)
    [1] 4.184753
    ## Tidyverse syntax
    
    df %>% 
      summarize(mean.count = mean(def_count, na.rm = TRUE))
    # A tibble: 1 × 1
      mean.count
           <dbl>
    1       4.18
    ## How to calculate it in the long/manual way?
    
    sum(df$def_count)/length(df$def_count)
    [1] 4.184753
    ## Try out!

    Recall the sample mean equation from the lecturer slide:

    \[\bar{x} = \frac{1}{N}\sum_{i=1}^{N}x_{i}=\frac{x_{1}+x_{2}+...+x_{N}}{N}\]

  2. Calculate the median

    ## This is the easy way to calculate the median:
    median(df$def_count, na.rm = TRUE)
    [1] 3
    ## How to calculate it in the long/manual way?
    ## First, sort the distribution from smallest to largest values
    
    ## Next, identify the "middle" value, which is (N+1)/2  
    
    sort(df$def_count)[(length(df$def_count)+1)/2]
    [1] 3
    #If we wanted to do this in tidyverse:
    
    df %>% 
      arrange(def_count) %>% 
      select(def_count) %>% 
      slice((nrow(df)+1)/2)
    # A tibble: 1 × 1
      def_count
          <dbl>
    1         3
    ## Try out!

    Since the distribution of def_count has an odd number, we have one median. If instead we had an even number, notice that R will automatically take the average of the two middle values. Technically, as Gailmard (2014) points out, any number weakly between the two middle values is a median.

    fake.data <- c(3, 0, 2, 10, 7, 1) ## notice: even number of values!
    
    sort(fake.data)
    [1]  0  1  2  3  7 10
    median(fake.data) 
    [1] 2.5

    R tells us the median is 2.5 but technically it’s any number on interval \([2, 3]\).

  3. Calculate the variance and standard deviation

    ## This is the easy way to calculate the variance and std. dev.:
    var(df$def_count, na.rm = TRUE)
    [1] 62.53998
    sd(df$def_count, na.rm = TRUE)
    [1] 7.908222
    ## How to calculate them in the long/manual way?
    ## Sample variance
    
    sum((df$def_count-mean(df$def_count))^2)/(length(df$def_count)-1) 
    [1] 62.53998
    ## Sample standard deviation
    
    sqrt(sum((df$def_count-mean(df$def_count))^2)/(length(df$def_count)-1))
    [1] 7.908222
    #In Tidyverse
    
    #Sample variance
    
    df %>% 
      summarize(variance = sum((def_count - mean(def_count))^2)/(nrow(df)-1))
    # A tibble: 1 × 1
      variance
         <dbl>
    1     62.5
    #Sample standard deviation
    
    df %>% 
      summarize(sd = sqrt(sum((def_count - mean(def_count))^2)/(nrow(df)-1)))
    # A tibble: 1 × 1
         sd
      <dbl>
    1  7.91
    ## Try out!

    Recall the sample variance equation from the lecturer slide:

    \[s^2_{x}=\frac{1}{N-1}\sum_{i=1}^{N}(x_i-\bar{x})^2\]

Now that you know how to manually compute the sample mean, median, variance, and standard deviation in R, let’s create a custom function that can calculate these four descriptive statistics all at once, without relying on the built-in R functions like mean(), median(), var(), or sd().

Please manually create a function named des_stat by using function():

## Create your own function

des_stat <- function(x){
  
x <- x[!is.na(x)] ## remove missing values from x

  mean.value <- sum(x)/length(x) ## sample mean
  median.value <- sort(x)[(length(x)+1)/2] ## sample median
  variance.value <- sum((x-mean(x))^2)/(length(x)-1) ## sample variance
  sd.value <- sqrt(variance.value)

summary.data <- data.frame(mean = mean.value,
                             median = median.value,
                             variance = variance.value,
                             sd = sd.value)

print(summary.data)
}

des_stat(df$def_count)
      mean median variance       sd
1 4.184753      3 62.53998 7.908222
## Try out!

How to create a function in R?

While applying built-in functions facilitates many common tasks, often we need to create our own function to automate the performance of a particular task. To declare a user-defined function in R, we use the keyword function. The syntax is as follows:

function_name <- function(parameters){
  function_body 
}

Above, the main components of an R function are: function name, function parameters, and function_body. Let’s take a look at each of them separately.

function name: This is the name of the function object that will be stored in the R environment after the function definition and used for calling that function. It should be concise but clear and meaningful so that the user who reads our code can easily understand what exactly this function does.

function parameters: Sometimes, they are called formal arguments. Function parameters are the variables in the function definition placed inside the parentheses and separated with a comma that will be set to actual values (called arguments) each time we call the function.

function_body: The function body is a set of commands inside the curly braces that are run in a predefined order every time we call the function. In other words, in the function body, we place what exactly we need the function to do.

For example:

circumference <- function(r){
  2*pi*r
}

circumference(2)
[1] 12.56637

Above, we created a function to calculate the circumference of a circle with a known radius using the formula C=2πr , so the function has the only parameter r. After defining the function, we called it with the radius equal to 2 (hence, with the argument 2). The function body we set will do the calculation for us (i.e., 2∗π∗2 ).

Here is an example when you have more than 1 parameter (argument) in your function:

sum_two_nums <- function(x, y){
    x + y
}
sum_two_nums(1, 2)
[1] 3
sum_two_nums(pi, 5)
[1] 8.141593

Download .qmd for Discussion Section II

Note: These materials were adapted from content prepared by Professors Yu-Shiuan (Lily) Huang (National Chengchi University) and Ryan Hubert (London School of Economics).

2. Relationships in Data

When we want to describe the relationship between two sets of data, we can plot the data sets in a scatter plot and look at four characteristics:

  • Direction: Are the data points sloping upwards or downwards?
  • Form: Do the data points form a straight line or a curved line?
  • Strength: Are the data points tightly clustered or spread out?
  • Outliers: Are there data points far away from the main body of data?

Today, our focus will be on the first three characteristics. We will delve into covariance, the correlation coefficient, and linear regression. These are three fundamental statistics that allow us to describe the relationships between the variables of interest.

a. Visualizing Relationships – Scatter Plot

For today, we are using Covid19.csv. You can download this csv file from Canvas or from here.

## load packages and import data
library(tidyverse)
  
cf <- read_csv("POL 211/data/Covid19.csv")

Let’s visualize the relationship between the politics of a county (dem16vs) and the share of people who say they always wear a mask (mask_always) in a scatter plot.

  • dem16vs: Democratic vote share in 2016 presidential election.
  • mask_always: The estimated share of people in a county who would say they always wear a mask in public when they expect to be within six feet of another person.
ggplot(cf, aes(x = dem16vs, y = mask_always)) + 
  geom_point(color = "black", fill = "white", 
             stroke = 1, shape = 1) + 
             ## Use the stroke aesthetic to modify the width of the border
             ## The shape of points can be adjusted by specifying the shape argument
  xlab("Democratic Vote Share, 2016 Presidential") + 
  ylab("% Reporting Always Wearing Mask") + 
  ylim(0,1) + 
  xlim(0,1) + 
  labs(title = "NY Times Covid-19 Survey", 
       subtitle = "All Counties") + 
  theme_bw()

The plot above is hard to see since all the dots are on top of each other. So instead of plotting all the datapoints, let’s draw a random sample of 500 of the rows of the dataset and plot them.

set.seed(110) ## this allows us to draw the *same* random sample every time.
    
samp <- sample(1:nrow(cf), 500) 
## This means randomly taking a sample of 500 from the elements of the total rows of cf 
glimpse(samp)
 int [1:500] 2008 2790 2483 336 2403 2921 1659 772 1427 2931 ...

samp is the rows we randomly sample from all the rows in cf. For example, the first element in samp is 2008, which means we’re going to draw the 2008th row out of the original dataset.

Now let’s replot the scatter plot again with only 500 datapoints we randomly drew from the whole dataset.

ggplot(cf[samp,], aes(x = dem16vs, y = mask_always)) + 
  ## Specify the random sample within the bracket of cf and remember to put it on the row position!
  geom_point(color = "black", fill = "white", 
             stroke = 1, shape = 1) + 
  xlab("Democratic Vote Share, 2016 Presidential") + 
  ylab("% Reporting Always Wearing Mask") + 
  ylim(0,1) + 
  xlim(0,1) + 
  labs(title = "NY Times Covid-19 Survey", 
       subtitle = "500 Randomly Selected Counties") + 
  theme_bw()

Be cautious about the axis scaling! Choosing how you scale the x and y axis doesn’t change the distribution or correlation, but it can mislead people!

p1 <- ggplot(cf[samp,], aes(x = dem16vs, y = mask_always)) + 
  geom_point(color = "black", fill = "white", 
             stroke = 1, shape = 1) + 
  xlab("Democratic Vote Share, 2016 Presidential") + 
  ylab("% Reporting Always Wearing Mask") + 
  ylim(-10,10) + ## change the scale of y axis
  xlim(0,1) + 
  labs(title = "NY Times Covid-19 Survey", 
       subtitle = "500 Randomly Selected Counties") + 
  theme_bw()

p2 <- ggplot(cf[samp,], aes(x = dem16vs, y = mask_always)) + 
  geom_point(color = "black", fill = "white", 
             stroke = 1, shape = 1) + 
  xlab("Democratic Vote Share, 2016 Presidential") + 
  ylab("% Reporting Always Wearing Mask") + 
  xlim(-2,2) + ## change the scale of x-axis
  ylim(0,1) + 
  labs(title = "NY Times Covid-19 Survey", 
       subtitle = "500 Randomly Selected Counties") + 
  theme_bw()

ggpubr::ggarrange(p1, p2, ncol = 2, nrow = 1)

b. Visualizing Relationships – Crosstab/Contingency Table

A crosstab or contingency table is also a way to observe the relationship between two variables.

Let’s look at the relationship between Covid-19 deaths and political breakdown of counties (whether the county voted for Trump or Clinton in 2016 presidential election). The goal is to replicate the same table as the one in the lecture slide (counties by vote choice and deaths/capita).

First, let’s create some new variables.

  1. Looking at the variable on Covid-19 deaths, some of these counties are bigger, so will have more deaths. To account for this, we have to rescale the deaths data to be per capita.

    cf <- cf %>%
      mutate(deaths.percap = c19deaths/population)
    
    cf %>% 
      select(county_name, c19deaths, population, deaths.percap) %>%
      head()
    # A tibble: 6 × 4
      county_name             c19deaths population deaths.percap
      <chr>                       <dbl>      <dbl>         <dbl>
    1 Autauga County, Alabama      1563      55869       0.0280 
    2 Baldwin County, Alabama      1858     223234       0.00832
    3 Barbour County, Alabama       331      24686       0.0134 
    4 Bibb County, Alabama          257      22394       0.0115 
    5 Blount County, Alabama        228      57826       0.00394
    6 Bullock County, Alabama      1025      10101       0.101  
  2. Generate a dummy variable indicating whether the county voted for Trump

    cf <- cf %>%
      mutate(trump = ifelse(dem16vs < 0.5, "Voted Trump", "Voted Clinton"))
    
    cf %>% 
      select(county_name, dem16vs, trump) %>% 
      head() ## check the new var to see if we create it correctly
    # A tibble: 6 × 3
      county_name             dem16vs trump        
      <chr>                     <dbl> <chr>        
    1 Autauga County, Alabama  0.238  Voted Trump  
    2 Baldwin County, Alabama  0.194  Voted Trump  
    3 Barbour County, Alabama  0.465  Voted Trump  
    4 Bibb County, Alabama     0.212  Voted Trump  
    5 Blount County, Alabama   0.0843 Voted Trump  
    6 Bullock County, Alabama  0.749  Voted Clinton
  3. Generate a categorical variable indicating which deaths per capita quartile a county is in.

    n25 <- ceiling(nrow(cf) * 0.25) ## The first quartile is in the 778th row 
    n50 <- ceiling(nrow(cf) * 0.50) ## The second quartile is in the 1555th row
    n75 <- ceiling(nrow(cf) * 0.75) ## The third quartile is in the 2333rd row
    
    cf <- cf %>%
      arrange(deaths.percap) %>% ## Arrange the data based on the order of values in deaths.percap
      mutate(case_id = row_number()) %>%
      mutate(deaths.quartile = case_when(
        case_id >= 1 & case_id <= n25 ~ "1st Quartile Deaths/Capita",
        case_id > n25 & case_id <= n50 ~ "2nd Quartile Deaths/Capita",
        case_id > n50 & case_id <= n75 ~ "3rd Quartile Deaths/Capita",
        case_id > n75 ~ "4th Quartile Deaths/Capita",
      ))
  4. Now, let’s create the crosstab: count the number of counties under different quartile deaths/capita and which candidate they voted for (Trump or Clinton).

    crosstab <- cf %>%
      count(deaths.quartile, trump) %>%
      spread(trump, n) %>% ## After the count operation, this spreads the trump variable into separate columns, one for each unique value in the trump variable. It places the counts (n) for each trump value into their respective columns. 
      mutate(Total = `Voted Clinton` + `Voted Trump`) ## Generate a column for the row total
    
    Total <- c("Total", unname(colSums(crosstab[, 2:4]))) ## Generate a row for the column total
    
    crosstab <- crosstab %>% rbind(Total)
    
    knitr::kable(crosstab, "simple", align = "llll")
    deaths.quartile Voted Clinton Voted Trump Total
    1st Quartile Deaths/Capita 42 736 778
    2nd Quartile Deaths/Capita 60 717 777
    3rd Quartile Deaths/Capita 88 690 778
    4th Quartile Deaths/Capita 212 565 777
    Total 402 2708 3110

A crosstab does not make it easier to see the relationship. Usually, when the two variables you are interested in are cardinal ones (both dem16vs and deaths.percap are continuous variables), plotting a scatter plot is more straightforward to see the relationship. In general, crosstab/contingency table is more useful when the two variables you are interested in are both categorical variables.

{r, message=F, error=F, warning=F, fig.align='center', fig.width=5, fig.height=4} ggplot(cf[samp,], aes(x = dem16vs, y = deaths.percap)) + geom_point(color = "black", fill = "white", stroke = 1, shape = 1) + xlab("Democratic Vote Share, 2016 Presidential") + ylab("Deaths Per Capita") + xlim(0,1) + labs(title = "NY Times Covid-19 Survey", subtitle = "500 Randomly Selected Counties") + theme_bw()

c. Covariance, Correlation Coefficient, Linear Regression

In addition to visualizing data using scatter plots or contingency tables to grasp relationships, how can we employ statistical methods to describe these relationships more formally?

  1. Covariance

Covariance is a useful measure for describing the direction of the linear association between two quantitative variables.

\[Cov(x, y) = \frac{1}{N-1}\sum_{i=1}^{N}(x_i-\bar{x})(y_i-\bar{y})\] As we can see from the equation, the covariance sums the term \((x_i-\bar{x})(y_i-\bar{y})\) for each data point, where \(\bar{x}\) is the average \(x\) value, and \(\bar{y}\) is the average \(y\) value. The term becomes more positive if both \(x\) and \(y\) are larger than the average values in the data set, and becomes more negative if smaller. As the covariance accounts for every data point in the set, a positive covariance must mean that most, if not all, data points are in sync with respect to \(x\) and \(y\) (small \(y\) when \(x\) is small or large \(y\) when \(x\) is large). Conversely, a negative covariance must mean that most, if not all, data points are out of sync with respect to \(x\) and \(y\) (small \(y\) when \(x\) is large or large \(y\) when \(x\) is small).

Let’s start by using a fake dataset to practice how to compute covariance.

    tf <- data.frame(x = c(3, 2, 1, 7, -1), 
                     y = c(8, 10, 13, 2, 8))
    
    knitr::kable(tf, "simple", align = "cc")
x y
3 8
2 10
1 13
7 2
-1 8
    ## Covariance by hand
    
    #I've started the necessary code below...
  
    n <- length(tf$x)
    mean.x <- mean(tf$x, na.rm = TRUE)
    mean.y <- mean(tf$y, na.rm = TRUE)

    #How would I write this out?

    covariance.tf <- (1/(n-1))*sum((tf$x - mean.x)*(tf$y - mean.y))

You can use the canned function cov() to double check your answer.

    ## Covariance
    cov(tf$x, tf$y)
[1] -8.85

Covariance is a useful measure at describing the direction of the linear association between two quantitative variables, but it has two weaknesses: a larger covariance does not always mean a stronger relationship, and we cannot compare the covariances across different sets of relationships. For example, if we find that the covariance between variables \(x\) and \(y\) is larger than the covariance between \(x\) and \(z\), we can only conclude that both \(y\) and \(z\) are positively associated with \(x\). However, we cannot determine whether the relationship between \(x\) and \(y\) is stronger than the relationship between \(x\) and $z based solely on their covariances.

  1. Correlation Coefficient

To account for the weakness, we normalize the covariance by the standard deviation of the \(x\) values and \(y\) values, to get the correlation coefficient. The correlation coefficient is a value between -1 and 1, and measures both the direction and the strength of the linear association. One important distinction to note is that correlation does not measure the slope of the relationship — a large correlation only speaks to the strength of the relationship. Some key points on correlation are:

  • Correlation measures the direction and strength of the linear association between two quantitative variables.
  • Positive and negative indicates direction, large and small indicates the strength.
  • Correlation has symmetry: correlation of x and y is the same as correlation of y and x.
  • Correlation is unitless and normalized.

\[r_{xy}=\frac{Cov(x, y)}{s_xs_y}\]

Please compute the correlation coefficient between \(x\) and \(y\) using the formula above with the fake dataset

    ## Correlation by hand
    
    #How would I finish it?
    
    covariance.tf <- (1/(n-1))*sum((tf$x - mean.x)*(tf$y - mean.y))

    correlation.tf <- covariance.tf/(sd(tf$x)*sd(tf$y))

You can also used the canned function cor() to double check your answers.

    ## Correlation
    cor(tf$x, tf$y)
[1] -0.7412154
  1. Linear Regression

Correlation and covariance are quantitative measures of the strength and direction of the relationship between two variables, but they do not account for the slope of the relationship. In other words, we do not know how a change in one variable could impact the other variable. Regression is the technique that fills this void — it allows us to make the best guess at how one variable affects the other variables. The simplest linear regression, which is usually measured by ordinary least square (OLS) method, allows us to fit a “line of best fit” to the scatter plot, and use that line (or model) to describe the relationship between the two variables. The OLS approach aims to fit a line that minimizes squared residuals. The equation for that line is:

\[\hat{y} = a + bx\]

The equations of \(a\) and \(b\) are:

\[b = \frac{r_{xy}s_{y}}{s_{x}}=\frac{Cov(x, y)}{s^2_{x}}\]

\[a = \bar{y}-b\bar{x}\]

Note: A more detailed introduction regarding linear regression will be discussed in POL 212 and 213.

What are the intercept (\(a\)) and slope (\(b\)) between \(x\) and \(y\) in the fake dataset? Please compute in a manual way!

    ##Calculating linear regression coefficients by hand:

    ## Slope

    beta <- ((1/(n-1))*sum((tf$x - mean.x)*(tf$y - mean.y)))/var(tf$x)
    
    beta <- cov(tf$x, tf$y)/var(tf$x)
    
    ## Intercept
    
    alpha <- mean(tf$y) - beta*mean(tf$x)
    
    beta
[1] -1.005682
    alpha
[1] 10.61364
You can also use the canned function `lm` to check your answer.
    reg <- lm(y ~ x, data = tf)
    stargazer::stargazer(reg, type = "text", digits = 4)

===============================================
                        Dependent variable:    
                    ---------------------------
                                 y             
-----------------------------------------------
x                             -1.0057          
                             (0.5258)          
                                               
Constant                     10.6136**         
                             (1.8813)          
                                               
-----------------------------------------------
Observations                     5             
R2                            0.5494           
Adjusted R2                   0.3992           
Residual Std. Error       3.1198 (df = 3)      
F Statistic             3.6578 (df = 1; 3)     
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01
d. Practice

Now that you know how to use covariance, correlation coefficients, and linear regression to describe relationships in data, let’s apply what you’ve learned to the Covid19.csv dataset to compute the covariance, correlation coefficient, and regression line between dem16vs and mask_always.

  1. What is the covariance and correlation coefficient between dem16vs and mask_always? Please compute them in a manual way.
    ## Covariance

    
    ## Correlation
  1. What are the intercepts and slopes for the regression line between dem16vs and mask_always? Note that in this practice, we are calculating the regression line based on the 500 randomly selected counties!
    ## Please find a and b! 
  1. Now you have the regression line, please add the line on the scatter plot using geom_abline()!
    ## Add the regression line! 

Download .qmd for Discussion Section III

Note: These materials were adapted from content prepared by Professors Yu-Shiuan (Lily) Huang (National Chengchi University) and Ryan Hubert (London School of Economics).

2. Learning from Data I

Note: Most of the materials for today’s discussion can be referred to STAT 414: Introduction to Probability Theory..

  1. Data Generating Process (DGP)

    The contemporary world is full of data and, especially, it is becoming easier and easier to transform it into data. Data analysis is about making good use of data for certain purposes, such as to predict or to make inferences about the world. For these tasks, a first step is to have a sound understanding on how the data was generated, i.e., which processes lead to its creation. The analytical strategy for data analysis will be highly dependent on which processes have generated the data under scrutiny.

    Overall, in the following method sequence classes, we will only consider the type of probability based (stochastic) data generation processes. In the probability based DGP, data is generated by a probabilistic process, i.e., each element of the population is selected with a (known) probability to form part of the sample data. One possible way to achieve this is by using a random mechanism when generating the data.

Example:

  • Research Question: What percentage of U.S. adults support the death penalty?

  • Steps in Statistical Investigation:

    1. Produce Data: Determine what to measure, then collect the data. The poll selected 1,082 U.S. adults at random. Each adult answered this question: “Do you favor or oppose the death penalty for a person convicted of murder?”

    2. Explore the Data: Analyze and summarize the data.In the sample, 65% favored the death penalty.

    3. Draw a Conclusion: Use the data, probability, and statistical inference to draw a conclusion about the population.

    However, there is a cost that has to be taken into account: by drawing a sample we have not observed all of the individuals in the population. It is necessary, then, to account for the uncertainty implied in the sampling process that generated the data. In addition to the sampling uncertainty, there are other sources of uncertainty that would lead us to assume our data is generated by a stochastic DGP: theoretical and fundamental uncertatinty.

    Let’s consider a scenario where we are not only interested in understanding the percentage of adults who support the death penalty in the U.S. but also whether there is a difference in support based on political affiliation, particularly Republicans. In this case, we have a theoretical framework that focuses on the relationship between a person’s political alignment and their stance on the death penalty. However, it’s possible that other factors, such as how individuals value a person’s life, could also influence their views on the death penalty. Unfortunately, we didn’t include these additional factors in our theory, which could potentially impact our data collection. Consequently, the results derived from our sample data might be subject to theoretical uncertainty. Additionally, there is the possibility of fundamental uncertainty within our sample data. This could arise when respondents pay minimal attention to the survey questions and opt for random responses regarding their opinions on the death penalty. Such factors are beyond the control of researchers and can introduce further uncertainty into our findings.

    In summary, statistical inferences always involve an element of uncertainty. As a researcher, our aim is to account for this uncertainty by incorporating probability statements as an integral part of our conclusions in our findings.

  1. Probability Distribution

    How do we model a stochastic data generating process? Any time we want to answer a research question that involves using a sample to draw a conclusion about some larger population, we need to answer the question “how likely is it that we get a sample proportion of people supporting for the death penalty at 65%?” or “what is the probability of…?”. To answer such a question, we need to use probability to quantify the uncertainty of the outcomes of interest.

    Just a reminder, when we employ a probability distribution to model a stochastic DGP, we are essentially playing God, and saying that the data for the variable of interest is generated by this probability distribution. However, in real-life, what we can directly observe is the empirical distribution (or frequency distribution) derived from our available sample data. We make an assumption that this empirical distribution from the sample data reflects the underlying theoretical probability distribution.

    1. Discrete Random Variables
    • What is a Random Variable \(X\)?

      Given a random experiment with sample space \(S\), a random variable \(X\) is a set function that assigns one and only one real number to each element \(s\) that belongs in the sample space \(S\). The set of all possible values of the random variable \(X\), denoted \(x\), is called the support, or space, of \(X\). For example:

      A rat is selected at random from a cage of male (\(M\)) and female rats (\(F\)). Once selected, the gender of the selected rat is noted. The sample space is thus: \[S = \left \{ M, F \right \}\]

      Let’s further define the random variable \(X\) as follows:

      • Let \(X = 0\) if the rat is male.
      • Let \(X = 1\) if the rat is female.

      Note that the random variable \(X\) assigns one and only one real number (0 and 1) to each element of the sample space (\(M\) and \(F\)). The support, or space, of \(X\) is \(\left \{ 0, 1 \right \}\).

      Also note that we don’t necessarily need to use the number 0 and 1 as the support. For example, we could have alternatively (and perhaps arbitrarily) used the numbers 5 and 15, respectively. In that case, our random variable would be defined as \(X = 5\) of the rate is male, and \(X = 15\) of the rate is female.

    • What is the Probability Mass Function (PMF) of a discrete random variable?

      The probability that a discrete random variable \(X\) takes on a particular value \(x\), that is, \(P(X = x)\), is frequently denoted \(f(x)\). The function \(f(x)\) is typically called the probability mass function (PMF).

      The probability mass function, \(P(X = x) = f(x)\), of a discrete random variable \(X\) is a function that satisfies the following properties:

      • \(P(X = x) = f(x) > 0\) if \(x \in\) the support \(S\)

        For every element in the support \(S\), all of the probabilities must be positive. Note that if \(x\) does not belong in the support \(S\), then \(f(x) = 0\).

      • \(\sum_{x \in S}^{}f(x)=1\)

        If you add up the probabilities for all of the possible \(x\) values in the support \(S\), then the sum must equal 1.

      • \(P(X \in A) = \sum_{x \in A}^{}f(x)\)

        To determine the probability associated with the event \(A\), you just sum up the probabilities of the \(x\) values in \(A\).

      Exercise 1:

      I toss a fair four-sided dice, and let \(X\) be defined as the outcome on the die I observe, \(\left \{ 1, 2, 3, 4 \right \}\). What is the sample space of \(X\), the possible outcomes of \(X\), as well as its probability mass function \(f(x)\)?

      We can also plot PMF in R.

        x <- 1:4
        n <- length(x)
        fx <- rep(1/4, n)
        plot(x = x, y = fx, pch = 19, 
            xaxt="n",
            ylim = c(0, 0.5),
            xlab = "x",
            ylab = "f(x)",
            main = "Probability Mass Function")
        axis(1, at = seq(min(x), max(x), by = 1))

    • What is the Cumulative Distribution Function (CDF) of a discrete random variable?

      The cumulative distribution function (CDF) of the discrete random variable \(X\) has the following definition:

      \[F(x) = P(X \leq x) = \sum_{a \leq x}^{}f(a)\]

      in which \(F(x)\) is a non-decreasing step function.

      We can also plot CDF in R. Let’s plot the CDF of the Exercise 1.

        x <- 1:4
        n <- length(x)
        fx <- rep(1/4, n)
        Fx <- cumsum(fx)
        ## Let's make an empty plot
        plot(x = NA, y = NA, pch = NA, 
            xaxt="n",
            xlim = c(min(x), max(x)),
            ylim = c(0, 1),
            xlab = "x",
            ylab = "F(x)",
            main = "Cumulative Distribution Function")
        axis(1, at = seq(min(x), max(x), by = 1))
    
        ## Add closed circles, open circles, and finally the horizontal lines
        points(x = x, y = Fx, pch = 19) ## closed circles
        points(x = x[-1], y = Fx[-n], pch = 1) ## open circles
        for(i in 1:(n-1)) {
          points(x = x[i+0:1], 
                 y = Fx[c(i,i)], 
                 type = "l")
         }

    Exercise 2:

    Suppose \(X\) is a discrete random variable. Let the PMF of \(X\) be equal to:

    \[f(x) = \frac{5-x}{10}, x = 1, 2, 3, 4. \]

    The CDF of \(X\) is \(F(x) = P(X \leq x) = \sum_{a \leq x}^{}f(a)\).

    2.1 What is the cumulative probability when \(X \leq 2\)?

    2.2 Is \(P(X \leq 2)\) equal to \(P(X = 2)\)?

  1. Continuous Random Variables

    A continuous random variable takes on an uncountably infinite number of possible values. For a discrete random variable \(X\) that takes on a finite or countably infinite number of possible values, we determined \(P(X=x)\) for all of the possible values of \(X\), and called it the probability mass function (PMF).

    For continuous random variables, the probability that \(X\) takes on any particular value \(x\) is 0. That is, finding \(P(X=x)\) for a continuous random variable \(X\) is not going to work. Instead, we’ll need to find the probability that \(X\) falls in some interval \((a, b)\), that is, we’ll need to find \(P(a<X<b)\). We’ll do that using a probability density function (PDF).

    • What is the Probability Density Function (PDF) of a continuous random variable?

      The probability density function of a continous random variable \(X\) with support \(S\) is an integrable function \(f(x)\) satisfying the following:

      1. \(f(x)\) is positive everywhere in the support \(S\), that is, \(f(x)>0\), for all \(x\) in \(S\).

      2. The area under the curve f(x) in the support \(S\) is 1, that is:

        \[\int_{S}^{}f(x)dx=1\]

      3. If \(f(x)\) is the PDF of \(x\), then the probability that \(X\) belongs to \(A\), where \(A\) is some interval, is given by the integral of \(f(x)\) over that interval, that is:

        \[P(X \in A) = \int_{A}^{}f(x)dx\]

      As you can see, the definition for the PDF of a continuous random variable differs from the definition for the PMF of a discrete random variable by simply changing the summations that appeared in the discrete case to integrals in the continuous case.

      Exercise 3:

      Let \(X\) be a continuous random variable whose PDF is:

      \(f(x) = 3x^2\), \(0<x<1\)

      First, note again that \(f(x) \neq P(X=x)\). For example, \(f(0.9)=3(0.9)^{2}=2.43\), which is clearly not a probability! In the continuous case, \(f(x)\) is instead the height of the curve at \(X=x\), so that the total area under the curve is 1. In the continuous case, it is areas under the curves that define the probabilities.

      Now let’s first start by verifying that \(f(x)\) is a valid probability density function.

      • \(f(x) > 0\) at any value of \(x\).

      • \(\int_{0}^{1}3x^2dx=x^3 |_{x=0}^{x=1}=1^3-0^3=1\)

      3.1 What is the probability that \(X\) falls between \(\frac{1}{2}\) and 1? That is, what is \(P(\frac{1}{2}<X<1)\)?

      3.2 Please use integration to show that \(P(X=\frac{1}{2})=0\).

      Exercise 4:

      Let \(X\) be a continuous random variable whose probability density function is:

      \[f(x) = \frac{x^3}{4}\]

      for any interval \(0<x<c\). What is the value of the constant \(c\) that makes \(f(x)\) a valid probability density function?

    • What is the Cumulative Distribution Function (CDF) of a continuous random variable?

      You might recall that the cumulative distribution function is defined for discrete random variables as:

      \[F(x) = P(X \leq x) = \sum_{a \leq x}^{}f(a)\]

      Again, \(F(x)\) accumulates all of the probability less than or equal to \(x\).The cumulative distribution function for continuous random variables is just a straightforward extension of that of the discrete case. All we need to do is replace the summation with an integral.

      The cumulative distribution function of a continuous random variable \(X\) is defined as:

      \[F(x) = P(X \leq x) = \int_{a \leq x}^{}f(a)da\]

      Recall that for discrete random variables, \(F(x)\) is in general a non-decreasing step function. For continuous random variables, \(F(x)\) is a non-decreasing continuous function.

      Exercise 5:

      Let’s return to the example in which \(X\) has the following probability density function:

      \(f(x) = 3x^2\), \(0<x<1\)

      What is the cumulative distribution function \(F(x)\)?

  1. Multivariate Probability Distributions

    Now we know the probability distribution of both discrete and continuous random variables, respectively. Let’s extend the concept of a probability distribution of one random variable \(X\) to a joint probability distribution of two discrete random variables \(X\) and \(Y\). For example, suppose \(X\) denotes the number of significant others a randomly selected person has, and \(Y\) denotes the number of arguments the person has each week. We might want to know if there is a relationship between \(X\) and \(Y\). Or, we might want to know the probability that \(X\) takes on a particular value \(x\) and \(Y\) takes on a particular value \(y\). That is, we might want to know \(P(X=x, Y=y)\).

    Suppose we toss a pair of fair, four-sided dice, in which one the dice is red and the other is black. We’ll let:

    • \(X\) = the outcome on the red die = \(\left \{ 1, 2, 3, 4 \right \}\)
    • \(Y\) = the outcome on the black die = \(\left \{ 1, 2, 3, 4 \right \}\)

    What is the support of \(f(x, y)\)? What is the probability that \(X\) takes on a particular value \(x\), and \(Y\) takes on a particular value \(y\)? That is, what is \(P(X=x, Y=y)\)? Please calculate the probability for each cell in the table below. Note that columns represent the outcomes of the black die, whereas rows represent the outcomes of the red die.

    \(f(x,y)\) \(Y=1\) \(Y=2\) \(Y=3\) \(Y=4\) \(f_X(x)\)
    \(X=1\)
    \(X=2\)
    \(X=3\)
    \(X=4\)
    \(f_Y(y)\)

    Now that we’ve found our first joint probability mass function, let’s formally define it now.

    Let \(X\) and \(Y\) be two discrete random variables, and let \(S\) denote the two-dimensional support of \(X\) and \(Y\). Then, the function \(f(x, y) = P(X=x, Y=y)\) is a joint probability mass function if it satisfies the following three conditions:

    1. \(0 \leq f(x, y) \leq 1\)

      Each probability must be a valid probability number between 0 and 1 (inclusive).

    2. \(\sum_{}^{}\sum_{(x, y) \in S}^{}f(x, y)=1\)

      The sum of the probabilities over the entire support \(S\) must equal 1.

    3. \(P[(X, Y)\in A]=\sum_{}^{}\sum_{(x, y)\in A}^{}f(x, y)\) where \(A\) is a subset of \(S\)

      In order to determine the probability of an event \(A\), you simply sum up the probabilities of the \((x, y)\) values in \(A\).

  2. Marginal Distribution

    Now, if you take a look back at the representation of our joint p.m.f. in tabular form, you can see that the last column contains the probability mass function of \(X\) alone, and the last row contains the probability mass function of \(Y\) alone. Those two functions, \(f(x)\) and \(f(y)\), which in this setting are typically referred to as marginal probability mass functions, are obtained by simply summing the probabilities over the support of the other variable. That is, to find the probability mass function of \(X\), we sum, for each \(x\), the probabilities when $y=1, 2, 3, $ and \(4\). That is, for each \(x\), we sum \(f(x, 1)\), \(f(x, 2)\), \(f(x, 3)\), and \(f(x, 4)\). Now that we’ve seen the two marginal probability mass functions in our example, let’s give a formal definition of a marginal probability mass function.

    Let \(X\) be a discrete random variable with support \(S_1\), and let \(Y\) be a discrete random variable with support \(S_2\), Let \(X\) and \(Y\) have the joint probability mass function \(f(x, y)\) with support \(S\). Then, the probability mass function of \(X\) alone, which is called the marginal probability mass function of \(X\), is defined by:

    \[f_X(x)=\sum{y}{}f(x, y)=P(X=x)\] \[x \in S_1\]

    where, for each \(x\) in the support \(S_1\), the summation is taken over all possible values of \(y\). Similarly, the probability mass function of \(Y\) alone, which is called the marginal probability mass function of \(Y\), is defined by:

    \[f_Y(y)=\sum{x}{}f(x, y)=P(Y=y)\] \[x \in S_2\]

    where, for each \(y\) in the support \(S_2\), the summation is taken over all possible values of \(x\).

    If you again take a look back at the representation of our joint p.m.f. in tabular form, you might notice that the following holds true:

    \[P(X=x, Y=y) = \frac{1}{16} = P(X=x) \cdot P(Y=y) = \frac{1}{4} \cdot \frac{1}{4} = \frac{1}{16}\]

    for all \(x \in S_1\), \(y \in S_2\). When this happens, we say that \(X\) and \(Y\) are independent. A formal definition of the independence of two random variables \(X\) and \(Y\) follows.

    The random variables \(X\) and \(Y\) are independent if and only if:

    \[P(X=x, Y=y) = P(X=x) \times P(Y=y)\]

    for all \(x \in S_1\), \(y \in S_2\). Otherwise, \(X\) and \(Y\) are said to be dependent.

    Exercise 6:

    Please find the marginal distributions of \(X\) and \(Y\), respectively.

    \(f(x, y)\) \(X=0\) \(X=1\) \(X=2\) \(f_Y(y)\)
    \(Y=1\) \(0.1\) \(0.2\) \(0.3\)
    \(Y=2\) \(0.05\) \(0.15\) \(0.2\)
    \(f_X(x)\)

    Is \(X\) and \(Y\) independent from each other or not?

  3. Conditional Distribution

    Marginal distributions tell us the distribution of a particular variable ignoring the other variable(s). In contrast, the conditional distribution of \(X\) given \(Y\) gives you a distribution over \(X\) when holding \(Y\) fixed at specific values. Formally:

    \[f_{X|Y}(x|y)=\frac{f(x, y)}{f_Y(y)}\]

    The conditional distribution of \(Y\) given \(X\) is defined by:

    \[f_{Y|X}(y|x)=\frac{f(x, y)}{f_X(x)}\]

    Recall the example we had for joint distribution.

    \(f(x,y)\) \(Y=1\) \(Y=2\) \(Y=3\) \(Y=4\) \(f_X(x)\)
    \(X=1\) \(\frac{1}{16}\) \(\frac{1}{16}\) \(\frac{1}{16}\) \(\frac{4}{16}\) \(\frac{4}{16}\)
    \(X=2\) \(\frac{1}{16}\) \(\frac{1}{16}\) \(\frac{1}{16}\) \(\frac{1}{16}\) \(\frac{4}{16}\)
    \(X=3\) \(\frac{1}{16}\) \(\frac{1}{16}\) \(\frac{1}{16}\) \(\frac{1}{16}\) \(\frac{4}{16}\)
    \(X=4\) \(\frac{1}{16}\) \(\frac{1}{16}\) \(\frac{1}{16}\) \(\frac{1}{16}\) \(\frac{4}{16}\)
    \(f_Y(y)\) \(\frac{4}{16}\) \(\frac{4}{16}\) \(\frac{4}{16}\) \(\frac{4}{16}\) \(1\)

    What is the conditional distribution of \(Y\) given \(X=2\)?

    When \(Y=1\): \(f_{Y|X}(y|x=2)=\frac{f(x, y)}{f_X(x=2)}=\frac{\frac{1}{16}}{\frac{4}{16}}=\frac{1}{4}\)

    When \(Y=2\): \(f_{Y|X}(y|x=2)=\frac{f(x, y)}{f_X(x=2)}=\frac{\frac{1}{16}}{\frac{4}{16}}=\frac{1}{4}\)

    When \(Y=3\): \(f_{Y|X}(y|x=2)=\frac{f(x, y)}{f_X(x=2)}=\frac{\frac{1}{16}}{\frac{4}{16}}=\frac{1}{4}\)

    When \(Y=4\): \(f_{Y|X}(y|x=2)=\frac{f(x, y)}{f_X(x=2)}=\frac{\frac{1}{16}}{\frac{4}{16}}=\frac{1}{4}\)

    So,

    \(f_{Y|X}(y|x=2)= \frac{1}{4} if y = 1, 2, 3, 4\)

    We will get the same answers for the conditional probability of \(Y\) given \(X=1\), \(X=3\), or \(X=4\).

    From this example of finding the conditional probability of \(Y\) given \(X=2\), we again find that \(X\) and \(Y\) are independent from each other, as the distribution over \(Y\) does not change as \(Y\) changes. That is,

    \[f(x, y) = f_X(x)\cdot f_Y(y) \Rightarrow f_{X|Y}{x|y} = f_X(x) and f_{Y|X}{y|x} = f_Y(y)\]

    Exercise 7:

    Find the conditional distribution of \(X\) given \(Y=2\).

    \(f(x, y)\) \(X=0\) \(X=1\) \(X=2\) \(f_Y(y)\)
    \(Y=1\) \(0.1\) \(0.2\) \(0.3\) \(0.6\)
    \(Y=2\) \(0.05\) \(0.15\) \(0.2\) \(0.4\)
    \(f_X(x)\) \(0.15\) \(0.35\) \(0.5\) \(1\)
  4. Central Tendency of Probability Distributions

    The expected value, or mathematical expectation, of a discrete random variable \(X\) is:

    \[E(X)=\sum_{x\in X}{}xf(x)\]

    where \(x\) is the possible outcomes of the discrete random variable \(X\), and \(f(x)\) is the probability mass function of \(X\).

    For example, what is the average toss of a fair six-sided die?

    If the random variable \(X\) is the top face of a tossed, fair, six-sided die, then the PMF of \(X\) is:

    \[f(x) = \frac{1}{6}\]

    for \(x = 1, 2, 3, 4, 5,\) and \(6\). Therefore, the average toss, that is the expected value of \(X\), is:

    \[E(X) = 1(\frac{1}{6}) + 2(\frac{1}{6}) + 3(\frac{1}{6}) + 4(\frac{1}{6}) + 5(\frac{1}{6}) + 6(\frac{1}{6}) = 3.5\]

    Hmm… if we toss a fair, six-sided die once, should we expect the toss to be 3.5? No, of course not! All the expected value tells us is what we would expect the average of a large number of tosses to be in the long run. If we toss a fair, six-sided die a thousand times, say, and calculate the average of the tosses, will the average of the 1000 tosses be exactly 3.5? No, probably not! But, we can certainly expect it to be close to 3.5. It is important to keep in mind that the expected value of is a theoretical average, not an actual, realized one! Let’s run a simulation to observe this more directly.

    x <- 1:6
    
    s1 <- sample(x, replace = TRUE, size = 100)
    mean(s1)
    [1] 3.25
    s2 <- sample(x, replace = TRUE, size = 1000)
    mean(s2)
    [1] 3.45
    s3 <- sample(x, replace = TRUE, size = 10000)
    mean(s3)
    [1] 3.508

    Expectations have some nice properties:

    1. The expected value of a constant \(a\) is that constant: \(E(a) = a\).

    2. The expected value of a constant \(b\) times a random variable \(X\) is that constant times the expected value of \(X\): \(E(bX) = bE(X)\).

    3. Given two random variables, \(X\) and \(Y\), \(E(X+Y) = E(X) + E(Y)\).

    4. \(E(X)\) always falls between the minimum and maximum values that \(X\) can take: \(min X \leq E(X) \leq max X\).

    Exercise 8:

    Suppose the PMF of the discrete random variable \(X\) is:

    \(x\) \(0\) \(1\) \(2\) \(3\)
    \(f(x)\) \(0.2\) \(0.1\) \(0.4\) \(0.3\)

    8.1 What is the expected value of \(X\), that is, \(E(X)\)?

    8.2 Suppose that there is another discrete random variable \(Y\), and the expected value of \(Y\) is \(3.2\), that is, \(E(Y) = 3.2\). What is the expected value of \(2X+3Y\), that is, \(E(2X+3Y)\)?

    Exercise 9:

    9.1 Find the expected values of the random variables from their marginal distributions.

    9.2 Find the expected value of \(X\) given \(Y=2\).

    \(f(x, y)\) \(X=0\) \(X=1\) \(X=2\) \(f_Y(y)\)
    \(Y=1\) \(0.1\) \(0.2\) \(0.3\) \(0.6\)
    \(Y=2\) \(0.05\) \(0.15\) \(0.2\) \(0.4\)
    \(f_X(x)\) \(0.15\) \(0.35\) \(0.5\) \(1\)

    All these nice properties can also be applied to a continuous random variable. The expected value of a continuous random variable is:

    \[\int_{x\in X}{}xf(x)dx\]

    Let’s use an example to practice. Let the random variable \(X\) denote the time a person waits for an elevator to arrive. Suppose the longest one would need to wait for the elevator is 2 minutes, so that the possible values of \(X\) (in minutes) are given by the interval \([0,2]\). A possible PDF for \(X\) is given by:

    \[f(x) = \begin{cases} x, & \text{for } 0 < x \leq 1\\ 2-x, & \text{for } 1 < x \leq 2 \\ 0, & \text{otherwise } \end{cases} \] What is the expected value of \(X\)?

    \[E(X) = \int_{0}^{1}x\cdot xdx + \int_{1}^{2}x\cdot(2-x)dx = \int_{0}^{1}x^2dx + \int_{1}^{2}(2x-x^2)dx = \frac{1}{3} + \frac{2}{3} = 1\]

    Thus, we expect a person will wait 1 minute for the elevator on average.

  5. Dispersion of Probability Distributions

    The variance of a random variable \(X\) is:

    \[ Var(X) = E((X-E(X))^2) = E(X^2) - E(X)^2\]

    The formulat can be applied to both discrete and continuous random variables.

    The variance of a random variable also has nice properties:

    1. The variance of a constant \(a\) is \(0\): \(Var(a) = 0\).

    2. \(Var(X)\) is always weakly positive: \(Var(X) \geq 0\).

    3. The variance of a constant \(b\) times a random variable \(X\) is that constant squared times the variance of \(X\): \(Var(bX) = b^2 Var(X)\).

    4. If \(X\) and \(Y\) are stochastically independent, then \(Var(X+Y) = Var(X) + Var(Y)\).

    Exercise 10:

    10.1 What is the variance of \(X\)? 10.2 What is Var(2X)?

    \(x\) \(0\) \(1\) \(2\) \(3\)
    \(f(x)\) \(0.2\) \(0.1\) \(0.4\) \(0.3\)

    Exercise 11:

    Recall the elevator question, please compute the variance of \(X\).

    \[f(x) = \begin{cases} x, & \text{for } 0 < x \leq 1\\ 2-x, & \text{for } 1 < x \leq 2 \\ 0, & \text{otherwise } \end{cases} \]

Download .qmd for Discussion Section IV

1. Hypothesis Testing

Hypothesis testing is the process of making a choice between two conflicting hypotheses. It is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution. First, a tentative assumption is made about the parameter or distribution. This assumption is called the null hypothesis and is denoted by \(H_0\). An alternative hypothesis (denoted \(H_a\)), which is the opposite of what is stated in the null hypothesis, is then defined. The hypothesis-testing procedure involves using sample data to determine whether or not \(H_0\) can be rejected. If \(H_0\) is rejected, the statistical conclusion is that the alternative hypothesis \(H_a\) is true.

  • Null hypothesis (\(H_0\)): The null hypothesis states the “status quo”. This hypothesis is assumed to be true until there is evidence to suggest otherwise.

  • Alternative hypothesis (\(H_a\)): This is the statement that one wants to conclude. It is also called the research hypothesis.

The goal of hypothesis testing is to see if there is enough evidence against the null hypothesis. In other words, to see if there is enough evidence to reject the null hypothesis. If there is not enough evidence, then we fail to reject the null hypothesis.

We want to know the answer to a research question. We determine our null and alternative hypotheses. Now it is time to make a decision.

The decision is either going to be:

  1. reject the null hypothesis or …

  2. fail to reject the null hypothesis.

Note: Why can’t we say we “accept the null”? The reason is that we are assuming the null hypothesis is true and trying to see if there is evidence against it. Therefore, the conclusion should be in terms of rejecting the null.

What happens when we do not make the correct decision? When doing hypothesis testing, two types of mistakes may be made and we call them Type I error and Type II error. If we reject the null hypothesis when it is true, then we made a type I error. If the null hypothesis is false and we failed to reject it, we made another error called a Type II error.

The “reality”, or truth, about the null hypothesis is unknown and therefore we do not know if we have made the correct decision or if we committed an error. We can, however, define the likelihood of these events.

  • \(\alpha\): The probability of committing a Type I error. Also known as the significance level.

  • \(\beta\): The probability of committing a Type II error.

  • Power: Power is the probability the null hypothesis is rejected given that it is false (i.e., \(1-\beta\)).

\(\alpha\) and \(\beta\) are probabilities of committing an error so we want these values to be low. However, we cannot decrease both. As \(\alpha\) increases, \(\beta\) decreases.

Note: Type I error is also thought of as the event that we reject the null hypothesis GIVEN the null is true. In other words, Type I error is a conditional event and \(\alpha\) is a conditional probability. The same idea applies to Type II error and \(\beta\).

Trial Example

A man goes to trial and is tried for the murder of his ex-wife. He is either guilty or innocent.

Let’s set up the null and alternative hypotheses.

\[ \begin{cases} H_0: \text{The man is innocent.} \\ H_a: \text{The man is guilty.} \end{cases} \] Remember that we assume the null hypothesis is true and try to see if we have evidence against the null. Therefore, it makes sense in this example to assume the man is innocent and test to see if there is evidence that he is guilty.

  • What is the Type I Error in this example?

Type I error is committed if we reject \(H_0\) when it is true. In other words, when the man is innocent but found guilty. \(\alpha\) is the probability of a Type I error, or in other words, it is the probability that the man is innocent but found guilty.

  • What is the Type II Error in this example?

Type II error is committed if we fail to reject \(H_0\) when it is false. In other words, when the man is guilty but found not guilty. \(\beta\) is the probability of a Type II error, or in other words, it is the probability that the man is guilty but found not guilty.

As you can see here, the Type I error (putting an innocent man in jail) is the more serious error. Ethically thinking, it is more serious to put an innocent man in jail than to let a guilty man go free. So to minimize the probability of a type I error we would choose a smaller significance level (i.e., smaller \(\alpha\)).

a. Hypothesis Test for One-sample parameter

  1. One-sample Mean Hypothesis Test

Recall that the distribution of the population is Normal and/or the sample size is large (\(n>30\)), the sampling distribution of the sample mean \(\bar{X}\) is approximately normal with mean \(\mu\) and standard error \(\frac{\mu}{\sqrt{n}}\). Under these conditions, we can calculate z-scores, which follows the standard normal distribution, \(Z \sim N(0, 1^2)\). Then we can check how is it likely to observe such z-score within the standard normal distribution and decide whether we should reject the null hypothesis.

\[z = \frac{\bar{x}-\mu}{\frac{\mu}{\sqrt{n}}}\]

If the population standard deviation is unknown, we use the estimated standard error based on the sample standard deviation from the observed data, \(\frac{S_X}{\sqrt{n}}\). Under these conditions, we cannot calculate z-scores and instead have to calculate t-scores or t-statistics, which follows a t-distribution with \(n-1\) degrees of freedom.

\[t = \frac{\bar{x}-\mu}{\frac{S_X}{\sqrt{n}}}\]

Emergency Room Wait Time Example

The administrator at your local hospital states that on weekends the average wait time for emergency room visits is 10 minutes. Based on discussions you have had with friends who have complained on how long they waited to be seen in the ER over a weekend, you dispute the administrator’s claim. You decide to test your hypothesis. Over the course of a few weekends, you record the wait time for 40 randomly selected patients. The average wait time for these 40 patients is 11 minutes with a standard deviation of 3 minutes.

Do you have enough evidence to support your hypothesis that the average ER wait time exceeds 10 minutes? You opt to conduct the test at a 5% level of significance.

Answer:

  • Step 1: Set up the hypotheses and check conditions.

At this point we want to check whether we can apply the central limit theorem. The sample size is greater than 30, so we should be okay.

This is a right-tailed test.

\[ \begin{cases} H_0: \mu = 10 \\ H_a: \mu >10 \end{cases} \]

  • Step 2: Decide on the significance level, \(\alpha\).

    The problem states that \(\alpha = 0.5\).

  • Step 3: Calculate the test statistic.

Since the population standard deviation is unknown, we can only calculate a t-score based on the estimated standard error using the sample standard deviation, which is 3.

\[\begin{aligned} t^* &= \frac{\bar{x}-\mu}{\frac{S_X}{\sqrt{n}}}=\frac{11-10}{\frac{3}{\sqrt{40}}}= 2.108185 \end{aligned}\] - Step 4: Compute the appropriate p-value based on our alternative hypothesis.

        t <- (11-10)/(3/sqrt(40))
        pt(q = t, df = 40-1, lower.tail = FALSE)
[1] 0.020746

If lower.tail argument is set to be TRUE (default), probabilities are \(P(X \leq x)\), otherwise, \(P(X > x)\). Since we are doing a right-tailed test, lower.tail should be set to be FALSE.

  • Step 5: Make a decision about the null hypothesis.

Since our p-value is 0.020746, we know it is less than our significance level, 5%. Therefore, we reject the null hypothesis.

  • Step 6: State an overall conclusion.

There is enough evidence, at a significance level of 5%, to reject the null hypothesis and conclude that the mean waiting time is greater than 10 minutes.

  1. One-sample Proportion Hypothesis Test

Recall that if \(np\) and \(n(1-p)\) are both greater than five, then the sample proportion \(\hat{p}\) will have an approximate normal distribution with mean \(p\) and standard error \(\sqrt{\frac{p(1-p)}{n}}\). With these information, we are able to calculate z-score:

\[z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}}\]

Online Purchases Example

An e-commerce research company claims that 60% or more graduate students have bought merchandise online. A consumer group is suspicious of the claim and thinks that the proportion is lower than 60%. A random sample of 80 graduate students shows that only 22 students have ever done so. Is there enough evidence to show that the true proportion is lower than 60%? Please Conduct the hypothesis test at 5% Type I error rate.

  • Step 1: Set up the hypotheses and check conditions.

Since the research hypothesis is to check whether the proportion is less than 0.6 we set it up as a one (left)-tailed test:

\[ \begin{cases} H_0: p = 0.6 \\ H_a: p < 10 \end{cases} \]

Can we use the z-test statistic? The answer is yes since the hypothesized population value \(p\) is 0.6 and we can check that:

\(np = 80*0.6 = 48 > 5\)

\(n(1-p) = 80*0.4 = 32 > 5\)

  • Step 2: Decide on the significance level, \(\alpha\).

According to the question, \(\alpha = 0.05\).

  • Step 3: Calculate the test statistic.

\[z = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}} = \frac{\frac{22}{80}-0.6}{\sqrt{\frac{0.6(1-0.6)}{80}}} = -5.933661\] - Step 4: Compute the appropriate p-value based on our alternative hypothesis.

        z <- ((22/80) - 0.6)/sqrt((0.6*0.4)/80)
        pnorm(z, mean = 0, sd = 1)
[1] 1.481266e-09
  • Step 5: Make a decision about the null hypothesis.

Since our p-value is very small and less than our significance level of 5%, we reject the null hypothesis.

  • Step 6: State an overall conclusion.

There is enough evidence in the data provided to suggest, at 5% level of significance, that the true proportion of students who made purchases online was less than 60%.

b. Comparing Two Population Parameters

When we compare two population means or two population proportions, we consider the difference between the two population parameters. In other words, we consider either \(\mu_x - \mu_y\) or \(p_x - p_y\).

  1. Two-sample Means Hypothesis Test

We present the theory here to give you a general idea of how we can apply the Central Limit Theorem.

  • Let \(X\)have a normal distribution with mean \(\mu_x\), variance \(\sigma^2_x\), and standard deviation \(\sigma_x\).

  • Let \(Y\)have a normal distribution with mean \(\mu_y\), variance \(\sigma^2_y\), and standard deviation \(\sigma_y\).

  • If \(X\) and \(Y\) are independent, then \(X-Y\) will follow a normal distribution with mean \(\mu_x-\mu_y\), variance \(\sigma^2_x + \sigma^2_y\), and standard deviation \(\sqrt{\sigma^2_x + \sigma^2_y}\).

The idea is that, if the two random variables are normal, then their difference will also be normal. This is wonderful but how can we apply the Central Limit Theorem?

If \(X\) and \(Y\) are normal, we know that \(\bar{X}\) and \(\bar{Y}\) will also be normal. If \(X\) and \(Y\) are not normal but the sample size is large (\(n > 30\)), then \(\bar{X}\) and \(\bar{Y}\) will be approximately normal (applying the CLT). Using the theorem above, then \(\bar{X}-\bar{Y}\) will be approximately normal with mean \(\mu_x-\mu_y\) and standard deviation \(\sqrt{\frac{\sigma^2_x}{n_x}+\frac{\sigma^2_y}{n_y}}\).

However, in most cases, \(\sigma_x\) and \(\sigma_y\) are unknown, and they have to be estimated. It seems natural to estimate \(\sigma_x\) by \(S_x\) and \(\sigma_y\) by \(S_y\), so the estimated standard deivation is \(\sqrt{\frac{S_x}{n_x}+\frac{S_y}{n_y}}\).

Theoretically, when the variances of the populations of the two groups are unknown, and we use the sample variances of the observed groups to estimate them, the test statistics we calculate should be t-scores. These t-scores will follow the t-distribution under certain degrees of freedom, which vary based on conditions. Conditions such as whether the variance of two groups are equal or unequal. However, in this class, for simplicity and that when the sample size is large, t-distribution becomes standard normal distribution, we will treat the t-scores we calculate as z-scores when doing hypothesis testing and use standrad normal distribution to calculate p-value.

\[z = \frac{(\bar{x}-\bar{y})-(\mu_x - \mu_y)}{\sqrt{\frac{S_x}{n_x}+\frac{S_y}{n_y}}}\]

Mean Weight of Two Species of Turtles Example

Suppose we want to know whether or not the mean weight between two different species of turtles is equal. Suppose we collect a random sample of turtles from each population with the following information:

Sample 1:

  • Sample size: \(n_1 = 40\).
  • Sample mean weight: \(\bar{x_1} = 300\).
  • Sample standard deviation: \(s_1 = 18.5\).

Sample 2:

  • Sample size: \(n_2 = 38\).
  • Sample mean weight: \(\bar{x_2} = 305\).
  • Sample standard deviation: \(s_2 = 16.7\).

Please perform a two sample t-test at significance level \(\alpha = 0.05\).

  • Step 1: Set up the hypotheses and check conditions.




  • Step 2: Decide on the significance level, \(\alpha\).




  • Step 3: Calculate the test statistic.




  • Step 4: Compute the appropriate p-value based on our alternative hypothesis.




  • Step 5: Make a decision about the null hypothesis.




  • Step 6: State an overall conclusion.




  1. Two-sample Proportions Hypothesis Test

For a test for two proportions, we are interested in the difference. If the difference is zero, then they are not different (i.e., they are equal). Therefore, the null hypothesis will always be:

\[H_0: p_x - p_y = 0\]

Another way to look at it is \(p_x = p_y\). This is worth stopping to think about. Remember, in hypothesis testing, we assume the null hypothesis is true. In this case, it means that \(p_x\) and \(p_y\) are equal. Under this assumption, then \(\hat{p_x}\) and \(\hat{p_y}\) are both estimating the same proportion. Think of this proportion as \(p^*\). Therefore, the sampling distribution of both proportions, \(\hat{p_x}\) and \(\hat{p_y}\), will, under certain conditions, be approximately normal centered around \(p^*\), with standard error \(\sqrt{p^*(1-p^*)(\frac{1}{n_x+n_y})}\).

We take this into account by finding an estimate for this \(p^*\) using the two sample proportions. We can calculate an estimate of \(p^*\) using the following formula:

\[\frac{x+y}{n_x+n_y}\]

This value is the total number in the desired categories \(x+y\) from both samples over the total number of sampling units in the combined sample \(n_x+n_y\).

Putting everything together, if we assume \(p_x = p_y\), then the sampling distribution of \(\hat{p_x} - \hat{p_y}\) will be approximately normal with mean 0 and standard error of \(\sqrt{p^*(1-p^*)(\frac{1}{n_x+n_y})}\), under certain conditions. Therefore,

\[z = \frac{(\hat{p_x} - \hat{p_y})-0}{\sqrt{p^*(1-p^*)(\frac{1}{n_x+n_y})}}\]

which will follow standard normal distribution.

Received $100 by Mistake Example

Males and females were asked about what they would do if they received a $100 bill by mail, addressed to their neighbor, but wrongly delivered to them. Would they return it to their neighbor? Of the 69 males sampled, 52 said “yes” and of the 131 females sampled, 120 said “yes.”

Does the data indicate that the proportions that said “yes” are different for males and females at a 5% level of significance?

  • Step 1: Set up the hypotheses and check conditions.

\[\begin{cases} H_0: p_m - p_f = 0 \\ H_a: p_m - p_f \neq 0 \end{cases} \]

  • Step 2: Decide on the significance level, \(\alpha\).

According to the question, \(\alpha = 0.05\).

  • Step 3: Calculate the test statistic.

\(n_m = 69\), \(\hat{p_m} = \frac{52}{69}\)

\(n_f = 131\), \(\hat{p_f} = \frac{120}{131}\)

\(\hat{p^*} = \frac{52+120}{69+131} = \frac{172}{200} = 0.86\)

\(z^* = \frac{(\hat{p_m} - \hat{p_f})-0}{\sqrt{p^*(1-p^*)(\frac{1}{n_m+n_f})}} = \frac{\frac{52}{69}-\frac{120}{131}}{0.86(1-0.86)\sqrt{\frac{1}{69}+\frac{1}{131}}}=-3.146572\)

  • Step 4: Compute the appropriate p-value based on our alternative hypothesis.
        z <- ((52/69)-(120/131))/sqrt((0.86*0.14)*((1/69)+1/131))
        pnorm(z, mean = 0, sd = 1) + (1 - pnorm(-z, mean = 0, sd = 1))
[1] 0.001651968
  • Step 5: Make a decision about the null hypothesis.

Since our p-value of 0.001651968 is less than our significance level of 5%, we reject the null hypothesis.

  • Step 6: State an overall conclusion.

There is enough evidence to suggest that proportions of males and females who would return the money are different.

c. T-test in R

  1. One-sample T-test in R

To conduct a one-sample t-test in R, we use the syntax t.test(x, mu = 0) where x is the name of our variable of interest and mu is set equal to the mean specified by the null hypothesis.

For example, if we wanted to test whether the volume of a shipment of lumber (\(x\)) was less than usual (\(\mu = 39000\) cubic feet), we would run:

    # draw a sample
    set.seed(1234)
    treeVolume <- rnorm(100, mean = 36500, sd = 2000)

    t.test(treeVolume, mu = 39000) # H0: mu = 39000

    One Sample t-test

data:  treeVolume
t = -14.006, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 39000
95 percent confidence interval:
 35787.88 36585.07
sample estimates:
mean of x 
 36186.48 

You should get the same answer as calculated by hand.

    t <- (mean(treeVolume)-39000)/(sd(treeVolume)/sqrt(100)); t
[1] -14.00592
    pt(t, 100-1, lower.tail = TRUE)
[1] 1.600489e-25
  1. Independent-samples T-test in R

The independent-samples test can take one of three forms, depending on the structure of your data and the equality of their variances. The general form of the test is t.test(y1, y2, paired=FALSE). By default, R assumes that the variances of y1 and y2 are unequal, thus defaulting to Welch’s test. To toggle this, we use the flag var.equal=TRUE.

Let’s test the hypothesis in which Clevelanders and New Yorkers spend different amounts for eating outside on a monthly basis.

    # draw two samples 
    set.seed(1234)
    Spenders_Cleve <- rnorm(60, mean = 380, sd = 77)
    Spenders_NY <- rnorm(60, mean = 400, sd = 80)
  • Equal variance
        t.test(Spenders_Cleve, Spenders_NY, var.equal = TRUE)

    Two Sample t-test

data:  Spenders_Cleve and Spenders_NY
t = -4.7988, df = 118, p-value = 4.716e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -88.50983 -36.79977
sample estimates:
mean of x mean of y 
 347.3503  410.0051 
  • Unequal variance
        t.test(Spenders_Cleve, Spenders_NY, var.equal = FALSE)

    Welch Two Sample t-test

data:  Spenders_Cleve and Spenders_NY
t = -4.7988, df = 117.99, p-value = 4.716e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -88.50985 -36.79976
sample estimates:
mean of x mean of y 
 347.3503  410.0051 

Download .qmd for Discussion Section V