A detailed guide on how I scraped opioid prescription history from the CDC and created an animated map to highlight changing behavior.
For some reason, I really enjoy making maps as a way to illustrate state and regional variation. One area that has appropriately gotten a lot of press is opioid prescribing. The CDC is a great data resource, and maintains state and county-level prescibing data.
Unfortunately, I did not find that the data was available in an easily readable format, e.g. CSV, JSON, etc. What follows is a walk-through of how I went about scraping the CDC website to get this data.
It is good practice to check a site’s
robots.txt to make sure that they are cool with you accessing and scraping their data. The
polite package makes this pretty simple.
<polite session> https://www.cdc.gov/ User-agent: M Khan robots.txt: 44 rules are defined for 4 bots Crawl delay: 5 sec The path is scrapable for this user-agent
Now that I see that it is ok for me to scrape the data, I’d like to start out with a single case - the year 2017 - to test things out before replicating it for all available years.
Things seem reasonable for 2017, I can now apply the same process for all of the available years.
We find that the links for the years 2006 to 2017 are available at this link. I can use the same SelectorGadget method to identify and subsequently scrape the links for each of these years.
# Collect list of sublinks sublinks <- ralger::scrap( "https://www.cdc.gov/drugoverdose/maps/rxrate-maps.html", node = ".mb-3+ .mb-3 a")
Looks like I’ll need to append each of these with “https://www.cdc.gov/drugoverdose/maps/rxcounty”. Let’s create a dataframe that has the year and the corresponding link.
Now that we’ve got all of our links, we can perform the same function we did above with the 2017 for all the years.
Scrape all of the links:
opioid_df <- opioid_links %>% mutate(data_df = map(link, possibly(ralger::table_scrap, otherwise = NULL, quiet = FALSE)))
I’d like to point out that I used
possibly. This is to make sure that the mapping function is all for naught if there is an error of some sort. Check out the R4DS section on Iteration: Dealing with Failure, which also covers
possiblyarguments used here:
otherwise = NULL, i.e. if there is a failure, it will assign a value of
quiet = FALSEbecause I wanted a verbose output to see what’s going on
# Notice how 'data_df' column is just a nested df opioid_df %>% head(3)
# A tibble: 3 x 3 year link data_df <dbl> <chr> <list> 1 2006 https://www.cdc.gov/drugoverdose/maps/rxstate… <df[,3] [51 × … 2 2007 https://www.cdc.gov/drugoverdose/maps/rxstate… <df[,3] [51 × … 3 2008 https://www.cdc.gov/drugoverdose/maps/rxstate… <df[,3] [51 × …
For each year, I have a corresponding tibble that contains state level opioid prescribing data. I will have to use
unnest() to expand this.
complete_df <- opioid_df %>% unnest(cols = c(data_df)) %>% rename( "state" = "State", "state_abbrev" = "State Abbreviation", "opioid_Rx_rate" = "Opioid Dispensing Rate per 100" ) %>% # Name of abbreviation column changed from pre- and post-2017 mutate(state_abbrev = coalesce(state_abbrev, Abbreviation)) %>% select(-Abbreviation, -link) complete_df %>% head() %>% gt()
Above, I outlined how I went about scraping the CDC website for opioid prescribing data. Now, I’ll describe how I took this data and made a map and used
gganmiate to make a GIF that better shows the state-level trends from 2006-2019.
First, a blank canvas:
# library(sf) # follow installation instructions at: https://r-spatial.github.io/sf/ # library(albersusa) # library(gganimate) ggplot() + geom_sf(data = albersusa::usa_sf()) + ggthemes::theme_map() + coord_sf(crs = albersusa::us_aeqd_proj)
You may notice that the relevant geometric information for each state is provided in
albersusa::usa_sf(). Thus, we must map this to the data of interest with a join, e.g.
I’ll start by making a map of prescribing rates for 2006.
usa_sf() %>% inner_join(complete_df %>% filter(year == 2006), # filter for the year 2006 by = c("iso_3166_2" = "state_abbrev")) %>% ggplot() + geom_sf(aes(fill = opioid_Rx_rate)) + scale_fill_gradient2(name = "Opioid Rx rate", low = "blue", high = "red", midpoint = median(complete_df$opioid_Rx_rate)) + ggthemes::theme_map() + coord_sf(crs = albersusa::us_aeqd_proj) + theme(legend.position = "right") + labs(title = "Opioid Prescribing Rates in 2006", subtitle = "Rates reported as retail opioid prescriptions per 100 persons", caption = "by Mirza Khan (@mirzakhan_) \n Source: CDC National Center for Injury Prevention and Control")