Some Basic Statistics in R

Intro

Hello and thank you for coming! I’m Thomas Becker, an assistant researcher at the Institute for Policy & Social Research. Today we’re going to cover how to us R for a few basic statistical tasks. Much of this material is based on videos on Greg Martin’s free youtube channel.

Install packages

We’ll be using the gapminder and starwars datasets. Gapminder is primarily used for educational stuff like this, but you can find out more about it and the gapminder project here.

The star wars dataset is just a toy dataset that is part of the tidyverse package.

Dplyr will allow us to more easily manipulate our data.

Ggplot2 will let us visualize some of our results.

If you would like to follow along, please install the gapminder, dplyr, and ggplot2 packages using the following commands:

install.packages(“gapminder”) install.packages(“dplyr”) install.packages(“ggplot2”) install.packages(“tidyverse”)

Libraries

These are the libraries we will be using.

Please run the following:

library(gapminder)

library(dplyr)

library(tidyverse)

attach(gapminder)

Note the attach() function. This allows us to refer to varaibles in our dataset by their names alone. This way we can type gdp rather than gapminder$gdp, for instance

Taking a quick look

Lets see what is in our dataset. You can use the “summary()” command to get some descriptive statistics for each variable.

summary(gapminder)

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
##

unique(year)

##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007

length(unique(year))

## [1] 12

Looking at variables of interest

We can take a closer look at particular variables using the sd() and var() functions.

sd(gdpPercap)

## [1] 9857.455

var(pop)

## [1] 1.12695e+16

Histograms

We can look at the distribution of life expectancy and population graphically in our dataset using the hist() function.

hist(lifeExp)

hist(pop)

Because population is so skewed, we can use a log transformation to make a more useful graphic.

hist(log(pop))

Box plots

We may also be interested in depicting how life expectancy varies within and between continents.

boxplot(lifeExp ~ continent)

Relationship between gdp per capita and life expectency

Scatter plots can be a good way to get a quick look at the relationship between two numerical variables.

plot(lifeExp ~ gdpPercap)

Like the population distribution, we can use a log transformation to make this graph more useful

plot(lifeExp ~ log(gdpPercap))

Better data viz

Let’s use the ggplot2 package to make a more informative graphic depicting gdp per capita, population, and continent for each of our observations. We are also going to use the pipe operator, which looks like this: %>%

gapminder %>% 
filter(gdpPercap < 50000) %>%
  ggplot(aes(x=log(gdpPercap), y = lifeExp, size = pop, color = continent)) +
  geom_point(alpha= 0.5)

## Linear regression

Going back to just our gdp per capita and life expectancy variables, We can use the lm() function to fit a linear model to our data.

summary(lm(lifeExp ~ gdpPercap))

## 
## Call:
## lm(formula = lifeExp ~ gdpPercap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.754  -7.758   2.176   8.225  18.426 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.396e+01  3.150e-01  171.29   <2e-16 ***
## gdpPercap   7.649e-04  2.579e-05   29.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.49 on 1702 degrees of freedom
## Multiple R-squared:  0.3407, Adjusted R-squared:  0.3403 
## F-statistic: 879.6 on 1 and 1702 DF,  p-value: < 2.2e-16

Lets depict our linear model graphically

Now we are going manipulate our data to make it easier to work with using the dplyr package.

gapminder %>% 
  ggplot(aes(x=log(gdpPercap), y = lifeExp)) +
  geom_point(alpha= 0.5) +
  geom_smooth(method = lm)

## `geom_smooth()` using formula = 'y ~ x'

Lets try a ttest using a different dataset

Lets look at height and sex of starwars characters. We’re going to use dplyr filter and select functions to specify what variables and which observations we want to keep, then we use drop_na to get rid of observations with missing values for height.

dfsw <- starwars %>%
  select(sex, height) %>%
  filter(sex %in% c("male", "female"))%>%
  drop_na(height)

Suppose we want to see if male star wars characters are, on average, taller than female characters. To look at our sample averages, we can do the following.

dfsw %>%
  group_by(sex)%>%
  summarise(average_height = mean(height))

## # A tibble: 2 × 2
##   sex    average_height
##   <chr>           <dbl>
## 1 female           172.
## 2 male             179.

If we are interested, however, in the entire star wars universe, we need to do a statistical test to see if average heights of all star wars universe males and females are different.

T-test

t.test (height ~ sex, data = dfsw)

## 
##  Welch Two Sample t-test
## 
## data:  height by sex
## t = -1.1817, df = 48.471, p-value = 0.2431
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -20.396341   5.293584
## sample estimates:
## mean in group female   mean in group male 
##             171.5714             179.1228

Our results indicate that the observed difference in height from are sample is not statistically significant. There is a good chance that the difference in mean observed is entirely due to chance.

Bonus: Reading data from online source

df2 <- read_csv('https://github.com/nytimes/covid-19-data/raw/refs/heads/master/us-counties-2023.csv')

## Rows: 267009 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): county, state, fips
## dbl  (2): cases, deaths
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.