Hello and thank you for coming! I’m Thomas Becker, an assistant researcher at the Institute for Policy & Social Research. Today we’re going to cover how to us R for a few basic statistical tasks. Much of this material is based on videos on Greg Martin’s free youtube channel.
We’ll be using the gapminder and starwars datasets. Gapminder is primarily used for educational stuff like this, but you can find out more about it and the gapminder project here.
The star wars dataset is just a toy dataset that is part of the tidyverse package.
Dplyr will allow us to more easily manipulate our data.
Ggplot2 will let us visualize some of our results.
If you would like to follow along, please install the gapminder, dplyr, and ggplot2 packages using the following commands:
install.packages(“gapminder”) install.packages(“dplyr”) install.packages(“ggplot2”) install.packages(“tidyverse”)
These are the libraries we will be using.
Please run the following:
library(gapminder)
library(dplyr)
library(tidyverse)
attach(gapminder)
Note the attach() function. This allows us to refer to varaibles in our dataset by their names alone. This way we can type gdp rather than gapminder$gdp, for instance
Lets see what is in our dataset. You can use the “summary()” command to get some descriptive statistics for each variable.
summary(gapminder)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
unique(year)
## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
length(unique(year))
## [1] 12
We can take a closer look at particular variables using the sd() and var() functions.
sd(gdpPercap)
## [1] 9857.455
var(pop)
## [1] 1.12695e+16
We can look at the distribution of life expectancy and population graphically in our dataset using the hist() function.
hist(lifeExp)
hist(pop)
Because population is so skewed, we can use a log transformation to make a more useful graphic.
hist(log(pop))
We may also be interested in depicting how life expectancy varies within and between continents.
boxplot(lifeExp ~ continent)
Scatter plots can be a good way to get a quick look at the relationship between two numerical variables.
plot(lifeExp ~ gdpPercap)
Like the population distribution, we can use a log transformation to make this graph more useful
plot(lifeExp ~ log(gdpPercap))
Let’s use the ggplot2 package to make a more informative graphic depicting gdp per capita, population, and continent for each of our observations. We are also going to use the pipe operator, which looks like this: %>%
gapminder %>%
filter(gdpPercap < 50000) %>%
ggplot(aes(x=log(gdpPercap), y = lifeExp, size = pop, color = continent)) +
geom_point(alpha= 0.5)
## Linear regression
Going back to just our gdp per capita and life expectancy variables, We can use the lm() function to fit a linear model to our data.
summary(lm(lifeExp ~ gdpPercap))
##
## Call:
## lm(formula = lifeExp ~ gdpPercap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.754 -7.758 2.176 8.225 18.426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.396e+01 3.150e-01 171.29 <2e-16 ***
## gdpPercap 7.649e-04 2.579e-05 29.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.49 on 1702 degrees of freedom
## Multiple R-squared: 0.3407, Adjusted R-squared: 0.3403
## F-statistic: 879.6 on 1 and 1702 DF, p-value: < 2.2e-16
Now we are going manipulate our data to make it easier to work with using the dplyr package.
gapminder %>%
ggplot(aes(x=log(gdpPercap), y = lifeExp)) +
geom_point(alpha= 0.5) +
geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'
Lets look at height and sex of starwars characters. We’re going to use dplyr filter and select functions to specify what variables and which observations we want to keep, then we use drop_na to get rid of observations with missing values for height.
dfsw <- starwars %>%
select(sex, height) %>%
filter(sex %in% c("male", "female"))%>%
drop_na(height)
Suppose we want to see if male star wars characters are, on average, taller than female characters. To look at our sample averages, we can do the following.
dfsw %>%
group_by(sex)%>%
summarise(average_height = mean(height))
## # A tibble: 2 × 2
## sex average_height
## <chr> <dbl>
## 1 female 172.
## 2 male 179.
If we are interested, however, in the entire star wars universe, we need to do a statistical test to see if average heights of all star wars universe males and females are different.
t.test (height ~ sex, data = dfsw)
##
## Welch Two Sample t-test
##
## data: height by sex
## t = -1.1817, df = 48.471, p-value = 0.2431
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -20.396341 5.293584
## sample estimates:
## mean in group female mean in group male
## 171.5714 179.1228
Our results indicate that the observed difference in height from are sample is not statistically significant. There is a good chance that the difference in mean observed is entirely due to chance.
df2 <- read_csv('https://github.com/nytimes/covid-19-data/raw/refs/heads/master/us-counties-2023.csv')
## Rows: 267009 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): county, state, fips
## dbl (2): cases, deaths
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.