Chapter 8 Representing distributions

This chapter introduces us to two more ways we can represent univariate (single variable) data distributions as we start to learn how to compare two or more distributions and draw conclusions. So far we’ve focused on representing univariate distributions using frequency histograms (by default, geom_histogram() sorts data into different bins and tells you how many end up in each one). Frequency histograms are useful for examining the particulars of a single variable, but have limited utility when directly comparing distributions that contain different numbers of observations. We resolve this issue by introducing the normalized version of the frequency histogram, the probability mass function (PMF). As we’ll see, the PMF still has its limitations, which will motivate us to consider other, more robust representations of data distributions, such as the very useful cumulative distribution function (CDF).

# Packages
library(dplyr)
library(ggplot2)
library(readr)
# Dataset
county_complete <- read_rds(path = "data/county_complete.rds")