The entire world has been affected by the Covid-19 pandemic. While some countries have been devastated by the quick spread of Covid-19, some countries have been surprisingly effective in controlling the oubreak. The varying approaches in handling the pandemic is our topic of interest for this data analysis. Many different countries have taken different approaches in fighting this pandemic and we wanted to find out which countries in Europe have been most effective. Various methods like strict contract tracing, stay at home orders, non-essential business closures, etc. have been implemented in varying countries. In this tutorial, we will take you through all of the steps involved in the data science pipeline. We will show you techniques for data curation, exploratory data analysis, hypothesis testing, and machine learning to provide analysis on the topic and use that to provide valuable insights on how best to move forward in managing this disease. As university students, we are eager to get things back to normal so that hopefully we can resume school in the fall. This project is our way of contributing to help fight this pandemic together!
The first step to data analysis is finding and managing data. To track the effects of this and future projections I have used data from http://www.healthdata.org/ For simplicity we will look only at data from the top 10 economies in Europe: Germany, United Kingdom, France, Italy, Spain, Netherlands, Switzerland, Sweden, Poland, and Belgium (Mullan). The data provides future projections as well, but we are not interested in them for our purposes, so we will filter out any entries that have been provided since the last time the data was updated, which was May 4 at the time the data was downloaded. It will also be necessary to provide summary data for each country in the data set such as population, date locked down, etc.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stringr)
library(pracma)
## Warning: package 'pracma' was built under R version 3.6.3
##
## Attaching package: 'pracma'
## The following object is masked from 'package:purrr':
##
## cross
library("RSQLite")
library(dplyr)
Hospitalizations <-read_csv("Hospitalization_all_locs.csv") %>%
filter(location_name %in% c("Germany", "United Kingdom", "France", "Italy", "Spain", "Netherlands", "Switzerland", "Sweden", "Poland","Belgium")) %>%
filter(date < as.Date("2020-05-05")) %>%
filter(allbed_mean > 0)
## Parsed with column specification:
## cols(
## .default = col_double(),
## location_name = col_character(),
## date = col_date(format = ""),
## mobility_data_type = col_character(),
## total_tests_data_type = col_character()
## )
## See spec(...) for full column specifications.
head(Hospitalizations)
## # A tibble: 6 x 44
## V1 location_name date allbed_mean allbed_lower allbed_upper
## <dbl> <chr> <date> <dbl> <dbl> <dbl>
## 1 3260 Belgium 2020-03-09 12.0 7 22.5
## 2 3261 Belgium 2020-03-10 15.9 9.5 28.5
## 3 3262 Belgium 2020-03-11 23.9 15.5 41.5
## 4 3263 Belgium 2020-03-12 35.5 24.5 55
## 5 3264 Belgium 2020-03-13 55.2 40 81.5
## 6 3265 Belgium 2020-03-14 82.7 62.5 114.
## # ... with 38 more variables: ICUbed_mean <dbl>, ICUbed_lower <dbl>,
## # ICUbed_upper <dbl>, InvVen_mean <dbl>, InvVen_lower <dbl>,
## # InvVen_upper <dbl>, deaths_mean <dbl>, deaths_lower <dbl>,
## # deaths_upper <dbl>, admis_mean <dbl>, admis_lower <dbl>, admis_upper <dbl>,
## # newICU_mean <dbl>, newICU_lower <dbl>, newICU_upper <dbl>,
## # totdea_mean <dbl>, totdea_lower <dbl>, totdea_upper <dbl>,
## # bedover_mean <dbl>, bedover_lower <dbl>, bedover_upper <dbl>,
## # icuover_mean <dbl>, icuover_lower <dbl>, icuover_upper <dbl>,
## # deaths_mean_smoothed <dbl>, deaths_lower_smoothed <dbl>,
## # deaths_upper_smoothed <dbl>, totdea_mean_smoothed <dbl>,
## # totdea_lower_smoothed <dbl>, totdea_upper_smoothed <dbl>,
## # mobility_data_type <chr>, mobility_composite <dbl>,
## # total_tests_data_type <chr>, total_tests <dbl>, confirmed_infections <dbl>,
## # est_infections_mean <dbl>, est_infections_lower <dbl>,
## # est_infections_upper <dbl>
Next we, wrote a function to include population size of each of the European countries in the dataset. We use these numbers to calculate the percentage of hospitalizations per country. We then use ggplot in order to visualize the differences between each country, where each colored line reperesents a different country. These population figures were provided by Eurostat at https://ec.europa.eu/eurostat/tgm/refreshTableAction.do?tab=table&plugin=1&pcode=tps00001&language=en.
popsize <- function(loc){
if (loc == "Germany"){
82576900
} else if (loc == "France"){
67187000
} else if (loc == "United Kingdom"){
65648100
} else if (loc == "Italy"){
60391000
} else if (loc == "Spain"){
46549045
} else if (loc == "Poland"){
38426000
} else if (loc == "Netherlands"){
17261622
} else if (loc == "Sweden"){
10142686
} else if (loc == "Belgium"){
11399335
} else{
8648907
}
}
Hospitalizations$DailyDeathsPerHundredThousand <- Hospitalizations$deaths_mean/popsize(Hospitalizations$location_name) * 100000
## Warning in if (loc == "Germany") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (loc == "France") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (loc == "United Kingdom") {: the condition has length > 1 and only
## the first element will be used
## Warning in if (loc == "Italy") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (loc == "Spain") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (loc == "Poland") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (loc == "Netherlands") {: the condition has length > 1 and only
## the first element will be used
## Warning in if (loc == "Sweden") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (loc == "Belgium") {: the condition has length > 1 and only the
## first element will be used
Hospitalizations %>% ggplot(aes(x=Hospitalizations$date,y=Hospitalizations$DailyDeathsPerHundredThousand,color = location_name)) + geom_point() + geom_line()
We can observe a difference in trends across the different countries, as countries such as Italy and Spain experienced a peak right before April 1, and then started to decrease steadily and sharply into the more recent weeks, while countires such as the United Kingdom have a lot more irregular shapes and show evidence of peaking almost randomly with a steady decline nowhere in sight.
We would like to look into other metrics as mentioned before to see if we can identify factors that might have influenced the shapes of the curves in the graph above for each different country. To do that, we must parse and clean the summary data file and choose specific columns of interest, as the set of data is very large.
Summary <- read_csv("Summary_stats_all_locs.csv") %>%
filter(location_name %in% c("Germany", "United Kingdom", "France", "Italy", "Spain", "Netherlands", "Switzerland", "Sweden", "Poland","Belgium"))
## Parsed with column specification:
## cols(
## .default = col_date(),
## location_name = col_character(),
## all_bed_capacity = col_double(),
## icu_bed_capacity = col_double(),
## all_bed_usage = col_double(),
## icu_bed_usage = col_double(),
## available_all_nbr = col_double(),
## available_icu_nbr = col_double()
## )
## See spec(...) for full column specifications.
Summary
## # A tibble: 10 x 28
## location_name peak_bed_day_me~ peak_bed_day_lo~ peak_bed_day_up~
## <chr> <date> <date> <date>
## 1 Belgium 2020-04-11 2020-04-10 2020-05-09
## 2 France 2020-04-03 2020-04-02 2020-05-02
## 3 Germany 2020-04-16 2020-04-15 2020-05-03
## 4 Italy 2020-03-28 2020-03-27 2020-03-28
## 5 Netherlands 2020-04-04 2020-04-02 2020-05-10
## 6 Poland 2020-05-02 2020-04-26 2020-05-09
## 7 Spain 2020-03-30 2020-03-29 2020-03-30
## 8 Sweden 2020-05-31 2020-04-17 2020-06-07
## 9 Switzerland 2020-04-04 2020-04-02 2020-05-08
## 10 United Kingd~ 2020-04-11 2020-04-09 2020-05-10
## # ... with 24 more variables: peak_icu_bed_day_mean <date>,
## # peak_icu_bed_day_lower <date>, peak_icu_bed_day_upper <date>,
## # peak_vent_day_mean <date>, peak_vent_day_lower <date>,
## # peak_vent_day_upper <date>, all_bed_capacity <dbl>, icu_bed_capacity <dbl>,
## # all_bed_usage <dbl>, icu_bed_usage <dbl>, available_all_nbr <dbl>,
## # available_icu_nbr <dbl>, travel_limit_start_date <date>,
## # travel_limit_end_date <date>, stay_home_start_date <date>,
## # stay_home_end_date <date>, educational_fac_start_date <date>,
## # educational_fac_end_date <date>, any_gathering_restrict_start_date <date>,
## # any_gathering_restrict_end_date <date>, any_business_start_date <date>,
## # any_business_end_date <date>, `all_non-ess_business_start_date` <date>,
## # `all_non-ess_business_end_date` <date>
We may first view what countries have been most effective in mitigating deaths by running a linear model across all countries in our data set. Because we are measuring Death Rates, we can reasonably say that countries that are associated with lover coefficients have been more effective in mitigating deaths up till this point. Using the tidy() and kable() funtions we can view summary statistics clearly which will help us best draw conslusions. The null hypothesis is that all countries have been equally effective in itigating deaths and the p-value cutoff we will use is .05.
library(broom)
library(tidyverse)
location_lm <- lm(DailyDeathsPerHundredThousand~location_name, data=Hospitalizations)
location_lm %>% tidy() %>% knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 1.2195233 | 0.2459908 | 4.9575974 | 0.0000009 |
location_nameFrance | 2.2271015 | 0.3370095 | 6.6084245 | 0.0000000 |
location_nameGermany | -0.0905455 | 0.3526821 | -0.2567339 | 0.7974749 |
location_nameItaly | 2.1487952 | 0.3254152 | 6.6032412 | 0.0000000 |
location_nameNetherlands | -0.4486072 | 0.3463808 | -1.2951272 | 0.1957898 |
location_namePoland | -1.0873645 | 0.3680938 | -2.9540417 | 0.0032632 |
location_nameSpain | 2.3926624 | 0.3394995 | 7.0476159 | 0.0000000 |
location_nameSweden | -0.7778707 | 0.3510318 | -2.2159549 | 0.0270807 |
location_nameSwitzerland | -0.9629653 | 0.3421330 | -2.8145937 | 0.0050491 |
location_nameUnited Kingdom | 3.2024029 | 0.3478835 | 9.2053886 | 0.0000000 |
Looking at the results of our regression model, we can reasonably conclude that some European countries have been much more effective than others in mitigating Covid-19 deaths. For all countries except Germany and Netherlands we can reject the null hypothesis and say that their efforts or lack their of have made a statistically significant difference in death rates amongst their whole population. We can conclude this because they are the only two countries with associated p-values greater than .05. Poland have been most effective in mitigating deaths, showing one fewer death on average per 100,000 people than the baseline of Belgium. However, the United Kingdom has been least effective of the ten countries showing about three more deaths per 100,000 people on average.
Now that we have shown which countries have been most effective in mitigating deaths, we can also show which stretegies have been most effective.
We can observe columns such as peak_bed_day_mean, peak_icu_bed_day_mean, peak_vent_day_mean, all_bed_capacity, all_bed_usage, and stay_at_home_start_date that may be interesting in our goal to pick out factors that could contribute to us predicting the level of hospitalization certain countries are currently at.
Our goal will be to create a regression model where we will use the summary dataframe columns as independent variables in order to see if we can predict the number of deaths per hundred thousand on a particular date, May 4, which is the most recent date in our Hospitalizations dataframe. We can then compare the regression results with the actual numbers reported in the Hospitalization data frame.
The first step will be to filter the columns we want from the summary data frame and combine them with specific columns of the Hospitalization data based on country name. We will do an inner join of the two dataframes.
#we would like to retain the hospitalization data for the dates 5/4 as our independent variable, and 3/29 as our dependent variable.
df = inner_join(Hospitalizations,Summary, by='location_name')%>%
filter(date == as.Date("2020-05-04") | date == as.Date("2020-03-29"))
keeps <- c("location_name", "peak_bed_day_mean", "peak_icu_bed_day_mean", "peak_vent_day_mean", "all_bed_capacity", "all_bed_usage", "stay_home_start_date", "any_gathering_restrict_start_date", "all_non-ess_business_start_date","DailyDeathsPerHundredThousand", "date")
# keeping relevant columns and adding a conditional column for if the country instituted a stay at home order
df = df[keeps]
df$stay_home_binary <- ifelse(is.na(df$stay_home_start_date), 0, 1)
df$restrict_gathering_binary <- ifelse(is.na(df$any_gathering_restrict_start_date), 0, 1)
df$non_ess_restriction_binary <- ifelse(is.na(df$'all_non-ess_business_start_date'), 0, 1)
df$bed_usage_percentage <- df$all_bed_usage / df$all_bed_capacity
# we then need to make the each date a column with each row's value for the column being the specific deaths per 100,000 per country. Tidying up the data so it is in the correct format for regression.
df$may4 <- ifelse(df$date == as.Date("2020-05-04"), df$DailyDeathsPerHundredThousand, NA)
df$march29 <- ifelse(df$date == as.Date("2020-03-29"), df$DailyDeathsPerHundredThousand, NA)
dfmay4 = df[!is.na(df$may4), ]
dfmay4 = dfmay4[c("location_name", "may4")]
dfmarch29 = df[!is.na(df$march29), ]
dfmarch29 = dfmarch29[c("location_name", "march29")]
keeps <- c("location_name", "peak_bed_day_mean", "peak_icu_bed_day_mean", "peak_vent_day_mean", "bed_usage_percentage", "stay_home_binary", "non_ess_restriction_binary", "restrict_gathering_binary")
df = df[keeps]
df = df[!duplicated(df$location_name), ]
temp <- inner_join(dfmarch29, dfmay4, by="location_name")
df <- inner_join(df, temp, by='location_name')
df
## # A tibble: 10 x 10
## location_name peak_bed_day_me~ peak_icu_bed_da~ peak_vent_day_m~
## <chr> <date> <date> <date>
## 1 Belgium 2020-04-11 2020-04-12 2020-04-11
## 2 France 2020-04-03 2020-04-04 2020-04-04
## 3 Germany 2020-04-16 2020-04-16 2020-04-16
## 4 Italy 2020-03-28 2020-03-29 2020-03-28
## 5 Netherlands 2020-04-04 2020-04-05 2020-04-05
## 6 Poland 2020-05-02 2020-05-03 2020-05-01
## 7 Spain 2020-03-30 2020-03-31 2020-03-30
## 8 Sweden 2020-05-31 2020-06-01 2020-05-31
## 9 Switzerland 2020-04-04 2020-04-06 2020-04-05
## 10 United Kingd~ 2020-04-11 2020-04-14 2020-04-13
## # ... with 6 more variables: bed_usage_percentage <dbl>,
## # stay_home_binary <dbl>, non_ess_restriction_binary <dbl>,
## # restrict_gathering_binary <dbl>, march29 <dbl>, may4 <dbl>
#After all that work cleaning up the data now we have our final dataframe to work with for the linear regression.
The next step is to create our linear model with the may 4th death per 100k as the independent variable we are trying to predict, and the dependent variables being all the other columns in this dataframe that could act as indicators for how a particular country is dealing with the virus. This should ideally let us know which strategies for fighting the virus have been most effective.
For our model we are establishing that the null hypothesis is that none of the factors included in this data frame are significantly effective in lowering the number of deaths per 100,000 people. Our p-value will be alpha = .05 for this experiment.
library(broom)
auto_fit <- lm(may4~bed_usage_percentage+stay_home_binary+restrict_gathering_binary+non_ess_restriction_binary+march29, data=df)
auto_fit
##
## Call:
## lm(formula = may4 ~ bed_usage_percentage + stay_home_binary +
## restrict_gathering_binary + non_ess_restriction_binary +
## march29, data = df)
##
## Coefficients:
## (Intercept) bed_usage_percentage
## -2.79444 4.14002
## stay_home_binary restrict_gathering_binary
## 0.66795 -0.15731
## non_ess_restriction_binary march29
## 0.08547 0.29323
We can use the tidy() and knitr::kable() R functions to view the data in a clean and coherent way. We will take a look at some of the summary statistics from the model we just created to determine which disease fighting stretegies have been most effective.
auto_fit_stats <- auto_fit %>%
tidy()
auto_fit_stats %>% knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -2.7944414 | 4.7164069 | -0.5924937 | 0.5853914 |
bed_usage_percentage | 4.1400177 | 5.5772843 | 0.7422999 | 0.4991288 |
stay_home_binary | 0.6679535 | 0.6924462 | 0.9646288 | 0.3893565 |
restrict_gathering_binary | -0.1573097 | 1.0705478 | -0.1469432 | 0.8902856 |
non_ess_restriction_binary | 0.0854695 | 0.9488579 | 0.0900761 | 0.9325568 |
march29 | 0.2932309 | 0.1265186 | 2.3176894 | 0.0813426 |
Our Null Hypothesis for this experiment is that the factors tested will not affect death rate on May 04, 2020. None of the factors tested showed a p-value greater than .05. With a p-value of this size, we cannot reject the null hypothesis for any of the 5 factors tested. The regression model shows that bed_usage, stay at home restrictions, gethering restorctions, non-essential business restrictions, and the death rate on March 29 are not statistically significant for predicting death rates on May 4.
Now let’s visualize the residuals of our linear model based on each country, and also the residuals vs. the fitted values.
auto_fit %>% ggplot(aes(x=factor(df$location_name), y=.resid)) + geom_boxplot() + labs(title="Residuals vs Country", x="Country", y="Residual")
auto_fit %>%
augment() %>%
ggplot(aes(x=.fitted,y=.resid)) + geom_point() + labs(title="Fitted vs. R^2",
x = "fitted",
y = "R^2")
As you can see from the first plot, the residuals for each country are all far from 0. This is with the exception of Italy, which has a residual much closer to 0 than the other countries. However, since Italy is the only country which follows the model, overall this would indicate that our model is not very accurate in terms of using bed_usage, restrictions, and death rate on March 29, to predict death rates on May 4. This is significant to our analysis of the linear model.
In addition to this plot, the second plot shows the residuals against the fitted values. We can see that the Fitted vs. R^2 dotplot does not seem to violate iid assumptions as the residual values appear to be independent from fitted values.
Through our models we can draw some conclusions, but some others are yet to be seen. We can say with 95% confidence that different countries have been varyingly effective in mitigating deaths. However, we cannot draw conclusions about what methods have actually been most effective. Through our second model we were unable to conclude that the decision to enforce certain restriction such as stay-at-home orders and non-essential business closures were statistically significant factors in mitigating deaths as of May 04. This implies a few different possibilities. The first of which is that not enough data has been collected in order to draw legitimate conclusion on the matter. As our model has only considered the data for ten countries, we cannot conclude this with certainty. More data will be needed to draw conclusions about the efficacy of governemnt restrictions on movement. Furthermore, the statistically significant differences between countries may based on factors that have either not been tested or are not quatifiable. These may include culture, genetics, lifestyle, etc. Another interesting factor to consider is that while countries may be enacting stay at home or non-essential business closure orders, some countries may be less strict than others with actually enforcing it.
Given more time and resources, it would be interesting to further pursue this data analysis by looking at more countries. Another interesting analysis would be to compare countries within continents (based on the assumption that countries of the same continent have somehwat more similar culture, genetics, and lifestyle), and analyze data from this perspective.
Works Cited:
“Population on 1 January.” Eurostat - Tables, Graphs and Maps Interface (TGM) Table, ec.europa.eu/eurostat/tgm/refreshTableAction.do?tab=table&plugin=1&pcode=tps00001&language=en.
“IHME: COVID-19 Projections.” Institute for Health Metrics and Evaluation, covid19.healthdata.org/united-states-of-america.
Mullan, Laura. “Top 10 Economies in Europe.” Europe, 5 Oct. 2018, europe.businesschief.com/leadership/2285/Top-10-economies-in-Europe.