I’d like to start off this series of posts by taking a look at the data we’re going to be working with. As a reminder, I’m sharing the code for this series at https://github.com/dkincaid/web-app-explore.

As I mentioned last post the data that we’re going to use comes from the Tycho project at University of Pittsburgh. I’m going to use the Level 2 data. I chose this data for a couple reasons. The first is that it is a little larger dataset (47 diseases over 1888-2013) and offers an API to access the data.

For more information on this data see:

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

Overview

I have already downloaded and processed all the data for all the states into a file named states_cases.Rda. Let’s load that up and have a look.

options(width = 120, warn = -1)
load("states_cases.Rda")
str(states_cases)
## 'data.frame':	819885 obs. of  6 variables:
##  $ year   : chr  "1942" "1942" "1942" "1942" ...
##  $ week   : chr  "1" "2" "3" "4" ...
##  $ state  : Factor w/ 57 levels "AL","AR","AZ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ number : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ event  : Factor w/ 1 level "CASES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ disease: Factor w/ 39 levels "ANTHRAX","BOTULISM",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(states_cases)
##      year               week               state            number        event       
##  Length:819885      Length:819885      NY     : 82686   Min.   :    0   CASES:819885  
##  Class :character   Class :character   CA     : 22953   1st Qu.:    1                 
##  Mode  :character   Mode  :character   TX     : 21116   Median :    5                 
##                                        OH     : 20316   Mean   :   65                 
##                                        FL     : 20070   3rd Qu.:   28                 
##                                        MI     : 19188   Max.   :89363                 
##                                        (Other):633556                                 
##                           disease      
##  MEASLES                      :118727  
##  WHOOPING COUGH [PERTUSSIS]   : 75424  
##  DIPHTHERIA                   : 65007  
##  SCARLET FEVER                : 64310  
##  SMALLPOX                     : 53461  
##  TYPHOID FEVER [ENTERIC FEVER]: 41711  
##  (Other)                      :401245

So we have counts of the number of cases reported in each state for each week of the year. What is the range of years?

c(min(states_cases$year), max(states_cases$year))
## [1] "1900" "2013"

Let’s see how the total cases for each disease in the data. Here I’m going to use the new package from Hadley Wickam, dplyr (which I am loving!).

library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
states_cases %.% group_by(disease) %.% summarise(total = sum(number)) %.% arrange(desc(total))
## Source: local data frame [39 x 2]
## 
##                               disease    total
## 1                             MEASLES 22591587
## 2                           INFLUENZA  6741366
## 3                           CHLAMYDIA  5237336
## 4                       SCARLET FEVER  5166016
## 5                           GONORRHEA  4117671
## 6          WHOOPING COUGH [PERTUSSIS]  2813601
## 7              CHICKENPOX [VARICELLA]  1808532
## 8                               MUMPS   914045
## 9                          DIPHTHERIA   910819
## 10          STREPTOCOCCAL SORE THROAT   638494
## 11                            RUBELLA   479160
## 12 TUBERCULOSIS [PHTHISIS PULMONALIS]   300760
## 13                          PNEUMONIA   286079
## 14      TYPHOID FEVER [ENTERIC FEVER]   259413
## 15                           SMALLPOX   242003
## 16                      SALMONELLOSIS   205408
## 17                  RABIES IN ANIMALS   148768
## 18                       LYME DISEASE   128033
## 19                         GIARDIASIS    96853
## 20                        SHIGELLOSIS    68185
## 21                          DYSENTERY    56341
## 22                            MALARIA    37629
## 23       BRUCELLOSIS [UNDULANT FEVER]    35940
## 24                 COCCIDIOIDOMYCOSIS    34481
## 25                  CRYPTOSPORIDIOSIS    30581
## 26       ROCKY MOUNTAIN SPOTTED FEVER    28198
## 27                      LEGIONELLOSIS    20464
## 28                          TULAREMIA    17088
## 29                       TYPHUS FEVER    16220
## 30                         MENINGITIS     4456
## 31                            TETANUS     2639
## 32                       TRICHINIASIS     2027
## 33               TOXIC SHOCK SYNDROME     1957
## 34                        PSITTACOSIS     1345
## 35                            ANTHRAX      204
## 36                            LEPROSY      150
## 37                      POLIOMYELITIS       64
## 38                           BOTULISM       15
## 39                           PELLAGRA        0

and what about the top 10 years?

states_cases %.% group_by(year) %.% summarise(total = sum(number)) %.% arrange(desc(total)) %.% head(10)
## Source: local data frame [10 x 2]
## 
##    year   total
## 1  1941 2063853
## 2  1928 1755548
## 3  1938 1738608
## 4  1943 1527009
## 5  1939 1490709
## 6  1929 1381464
## 7  1935 1359174
## 8  1946 1358645
## 9  1944 1344990
## 10 1973 1264263

How about looking at the worst 10 years for each disease (I’ll just show the first 3 diseases here to save some space)

topyear <- function(d) {
    subset(states_cases, disease == d) %.% group_by(year) %.% summarise(total = sum(number)) %.% arrange(desc(total)) %.% head(10)
}

topyears.by.disease <- lapply(levels(states_cases$disease), topyear)
names(topyears.by.disease) <- levels(states_cases$disease)
head(topyears.by.disease, 3)
## $ANTHRAX
## Source: local data frame [4 x 2]
## 
##   year total
## 1 1943    90
## 2 1942    67
## 3 1944    45
## 4 1945     2
## 
## $BOTULISM
## Source: local data frame [1 x 2]
## 
##   year total
## 1 1952    15
## 
## $`BRUCELLOSIS [UNDULANT FEVER]`
## Source: local data frame [10 x 2]
## 
##    year total
## 1  1947  6436
## 2  1946  5830
## 3  1945  4623
## 4  1948  2631
## 5  1952  2098
## 6  1953  1823
## 7  1954  1812
## 8  1955  1266
## 9  1956  1091
## 10 1957   915

So we can see that for each disease we have very different amounts of data. Let’s see what the min and max year is for each disease

states_cases %.% group_by(disease) %.% summarise(min = min(year), max = max(year), range = as.numeric(max) - as.numeric(min)) %.% arrange(disease)
## Source: local data frame [39 x 4]
## 
##                               disease  min  max range
## 1                             ANTHRAX 1942 1945     3
## 2                            BOTULISM 1952 1952     0
## 3        BRUCELLOSIS [UNDULANT FEVER] 1945 1981    36
## 4              CHICKENPOX [VARICELLA] 1972 2013    41
## 5                           CHLAMYDIA 2006 2013     7
## 6                  COCCIDIOIDOMYCOSIS 2006 2013     7
## 7                   CRYPTOSPORIDIOSIS 2006 2013     7
## 8                          DIPHTHERIA 1920 1981    61
## 9                           DYSENTERY 1942 1948     6
## 10                         GIARDIASIS 2006 2013     7
## 11                          GONORRHEA 1972 2013    41
## 12                          INFLUENZA 1919 1951    32
## 13                      LEGIONELLOSIS 1982 2013    31
## 14                            LEPROSY 1942 1945     3
## 15                       LYME DISEASE 2006 2013     7
## 16                            MALARIA 1952 2013    61
## 17                            MEASLES 1909 1982    73
## 18                         MENINGITIS 1917 1959    42
## 19                              MUMPS 1968 2013    45
## 20                           PELLAGRA 1928 1929     1
## 21                          PNEUMONIA 1920 1951    31
## 22                      POLIOMYELITIS 1939 1939     0
## 23                        PSITTACOSIS 1956 1961     5
## 24                  RABIES IN ANIMALS 1948 2013    65
## 25       ROCKY MOUNTAIN SPOTTED FEVER 1942 2011    69
## 26                            RUBELLA 1966 2001    35
## 27                      SALMONELLOSIS 2006 2013     7
## 28                      SCARLET FEVER 1909 1966    57
## 29                        SHIGELLOSIS 2006 2013     7
## 30                           SMALLPOX 1900 1952    52
## 31          STREPTOCOCCAL SORE THROAT 1960 1961     1
## 32                            TETANUS 1962 1972    10
## 33               TOXIC SHOCK SYNDROME 1983 1994    11
## 34                       TRICHINIASIS 1952 1955     3
## 35 TUBERCULOSIS [PHTHISIS PULMONALIS] 1921 2013    92
## 36                          TULAREMIA 1942 1973    31
## 37      TYPHOID FEVER [ENTERIC FEVER] 1909 1983    74
## 38                       TYPHUS FEVER 1942 1947     5
## 39         WHOOPING COUGH [PERTUSSIS] 1909 2013   104

Graphical visualizations

Finally, let’s look at a couple of plots. Here is a look at diphtheria and whooping cough in the states of New York, California, Texas and Illinois.

library(ggplot2, quietly = TRUE)
di <- c("DIPHTHERIA", "WHOOPING COUGH [PERTUSSIS]")

dip_whoop <- subset(states_cases, disease %in% di & state %in% c("NY", "CA", "TX", "IL"))

ggplot(dip_whoop, aes(x = as.Date(paste("1", week, year, sep = "-"), format = "%w-%W-%Y"), y = number)) + geom_bar(stat = "identity", width = 7) + scale_y_continuous("Cases", limits = c(0, 
    1500)) + scale_x_date("Week") + facet_grid(disease ~ state)

plot of chunk tycho-explore-1

Now I’ll create a matrix for Whooping Cough with states as rows and years as columns and create a heatmap.

library(reshape2, quietly = TRUE, warn.conflicts = FALSE)
library(RColorBrewer, quietly = TRUE, warn.conflicts = FALSE)
influenza.year.by.state <- acast(subset(states_cases, disease == "INFLUENZA"), state ~ year, value.var = "number", fun.aggregate = sum)
influenza.year.by.state[1:10, 1:10]
##    1919 1920 1921 1922 1923 1927  1928  1929 1930  1931
## AL   15    0    0    0    0  416 15705 69455 3281  5925
## AR  213   88    0    0    0  404 12323 20058 2529  3504
## AZ    0    0    0    0    0    1  4347   628  348   462
## CA  182  288    0    0    0  112 44906  5398 1716  5987
## CO    0    0    0    0    0    2  5677   998   11     0
## CT   60   15    0    0    0   49  1068 16748  251  1300
## DC    0  126    0    0    0    0   571  2028   36   219
## DE   14    1    0    0    0    3    43   486   18   275
## FL  114   12    0    0    0   39  1614  3737  104  1496
## GA  204   54    0    0    0  467 23688 29033 3348 12325
m <- melt(influenza.year.by.state)

ggplot(m, aes(x = Var2, y = Var1, fill = value)) + geom_tile() + scale_colour_brewer() + labs(x = "Year", y = "State", color = "Cases") + theme_bw()

plot of chunk tycho-explore-2

Next time

I think that is enough for now. That should give you a decent sense for the data we’ll be working with. Next time I’ll start creating the Shiny application that will let someone do some of this exploration on a web based UI.



blog comments powered by Disqus

Published

18 January 2014