# Numbers
compute_descriptive_stats(nh2007$gender)
compute_descriptive_stats(nh2007$education)
compute_descriptive_stats(nh2007$education_child)
compute_descriptive_stats(nh2007$asthma)
compute_descriptive_stats(nh2007$heart_failure)
compute_descriptive_stats(nh2007$coronary_heart_disease)
compute_descriptive_stats(nh2007$creatinine)
compute_descriptive_stats(nh2007$lead)
compute_descriptive_stats(nh2007$barium)
compute_descriptive_stats(nh2007$cadmium)
# Graph
compute_descriptive_graph(nh2007$creatinine)
compute_descriptive_graph(nh2007$lead)
compute_descriptive_graph(nh2007$barium)
compute_descriptive_graph(nh2007$cadmium)
Hands-on Activity 2
Introduction
Goal
At the end of this exercise, you should be able to produce an HTML document containing at least one table to summarize model results using gt
and quarto
with the nh2007
dataset. If we have enough time, you can reproduce the analysis with the nh2009
dataset.
Tasks
- Improve and simplify the code using functional programming
- Create a QMD document to generate a report using Quarto
- Make a table using
gt
to summarize model results - Continue to use Git to save your progress
Functional Programming
Descriptive Statistics
Let’s look at the script that we finished before the break. We created some functions that we now use for all the descriptive statistics. However, we still have to copy and paste the same call many times.
Exercice: Can you figure out how to use map
to avoid copy pasting the call?
The map
function can take a data frame as its first argument and will apply the function to each column of the data frame:
::map(.x = nh2007, .f = compute_descriptive_stats) purrr
For the descriptive graphs we can first use dplyr::select()
to select the four columns of interest:
|>
nh2007 ::select(creatinine, lead, barium, cadmium) |>
dplyr::map(.f = compute_descriptive_graph) purrr
map
return a list by default:
::map(.x = nh2007, .f = compute_descriptive_stats) |>
purrrhead(3)
$id
$id$mean
[1] 46562.61
$id$sd
[1] 2966.326
$id$quantiles
0% 25% 50% 75% 100%
41477.00 44060.75 46513.00 49178.75 51622.00
$gender
variable
1 2 <NA>
1022 1012 0
$age_screening
$age_screening$mean
[1] 47.41593
$age_screening$sd
[1] 19.2215
$age_screening$quantiles
0% 25% 50% 75% 100%
16 31 47 63 80
List are not so easy to work with because functions are mainly build to use a dataframe. However it’s very easy to modify compute_table()
and compute_numeric()
to return statistics in a different format.
When the two functions return a dataframe, it’s will be possible to use dplyr::bind_rows()
to bind all the elements of the list from map
into a single dataframe.
Let’s first change the two compute functions and load them into our environment:
<- function(variable) {
compute_table # Return frequency table as a dataframe
table(variable, useNA = "always", dnn = "level") |>
as.data.frame()
}
<- function(variable) {
compute_numeric <- mean(variable, na.rm = TRUE)
mean_value <- sd(variable, na.rm = TRUE)
sd_value <- quantile(variable, na.rm = TRUE)
quantiles
# Return statistics as a dataframe
cbind(
data.frame(
"mean" = mean_value,
"sd" = sd_value
),t(quantiles)
) }
Then we can run the lines with map
again:
::map(.x = nh2007, .f = compute_descriptive_stats) |>
purrr::bind_rows(.id = "column") |>
dplyrhead(5)
column mean sd 0% 25% 50% 75% 100% level
1 id 46562.61406 2966.3259 41477 44060.75 46513 49178.75 51622 <NA>
2 gender NA NA NA NA NA NA NA 1
3 gender NA NA NA NA NA NA NA 2
4 gender NA NA NA NA NA NA NA <NA>
5 age_screening 47.41593 19.2215 16 31.00 47 63.00 80 <NA>
Freq
1 NA
2 1022
3 1012
4 0
5 NA
The result is already much better and easier to work with. Notice that we didn’t have to change the compute_descriptive_stats()
function, only the computation functions. This is because no calculations are happening in compute_descriptive_stats()
: it’s only calling the appropriate compute function for each type of variable.
When modifying functions, it’s a good idea to commit the changes frequently so you can revert any unwanted changes if needed.
Jump back in Git-gui and commit the new changes to the functions.
Models
We can also use functional programming to simplify the creation of the models. For the 4 outcomes and the 4 exposures, the models are always the same:
# Creatinine
.1a <- glm(asthma ~ barium + age_screening + gender, data = nh2007)
model1.b <- glm(heart_failure ~ barium + age_screening + gender, data = nh2007)
model.1.c <- glm(coronary_heart_disease ~ barium + age_screening + gender, data = nh2007)
model.1.d <- glm(heart_attack ~ barium + age_screening + gender, data = nh2007)
model.1.e <- glm(asthma ~ barium + age_screening + gender, data = nh2007) # Notice that this model is identical to 1a, but it's hard to notice model.
The custom function to create need to have at least 2 arugments: one argument for the outcome and another for the exposure. It’s also a good idea to add another argument to specify the dataset to use in the model.
map2
and pmap
are variant of map
that can iterate over two or many arguments simultaneously.
All the functions that we will create can be saved in R/models.R
. Remember to add a line to source this file in your script.
Exercise: Create a function to build the glm
models. Use map2
to apply the function to all combinations of outcomes and exposures.
<- function(outcome, exposure, dataset) {
build_model <- paste0(
formula " ~ ",
outcome, " + age_screening + gender"
exposure, |>
) as.formula()
try(
glm(formula, data = dataset)
)
}
# List outcomes
<- c("asthma", "heart_failure", "coronary_heart_disease", "heart_attack")
outcomes <- c("creatinine", "lead", "barium", "cadmium")
exposures
<- tidyr::expand_grid(outcomes, exposures)
models_parameters
<- map2(
models .x = models_parameters$outcomes,
.y = models_parameters$exposures,
.f = \(x, y) build_model(x, y, dataset = nh2007)
)
try()
is used to catch any errors when the models don’t run for any reason. This allowsmap
to continue instead of stopping..f = \(x, y) build_model(x, y, dataset = nh2007)
helps pass a common element to all the models. Here, it’s the dataset that we want to use for the models. Of course, this dataset can also be passed firectly bypmap
to make the dataset vary, but it can be a little tricky.
Once again, all the models are in a list. It’s easy to access a model using indexes like models[[1]]
, but it’s a bit crude. It is possible to improve that by:
- Assigning names to each list element from the model parameters, so you can access models using a
$
sign. - Putting the models in a column in the models parameters dataframe (often a better solution).
Extract models results
Exercice: Create a function to extract estimates, CI, p-pvalue and AIC from a model. Use map
to apply it to every model.
It can be difficult to create a function from scratch, especially when working with lists. It’s sometimes easier to first work with one element of the list to test the function.
<- models[[1]] test_model
<- function(model) {
extract_model_result # Get coefficients
<- coef(summary(model)) |>
coefs as.data.frame()
# confidence interval
<- confint(model) |>
ci as.data.frame()
# AIC
<- model$aic
aic
# Return a dataframe
cbind(
coefs,
ci,
aic|>
) rownames_to_column(var = "term")
}
# Extract model results
<- map(models, extract_model_result) models_results
The broom
package is a very nice interface to reliably extract models results in a consistent way. It can easily replace this custom-made function: https://broom.tidymodels.org/
Make a table using gt
Prepare the data
In this section, we will focus on creating just one table because the gt
package offers many options, which could be the topic of an entire workshop. Feel free to experiment with it later if you have time, for example, to create a table for the descriptive statistics.
Before creating a gt
table, it’s best to have a clean table with only the information that we want to present. The first step is to consolidate the model parameters, models, and model results into one table.
<- models_parameters |>
results ::mutate(
dplyrmodels = models,
models_results = models_results
)
Then we can unnest()
the models_results
column in the dataframe, and filter the terms from the model results to keep only the terms matching the exposure of the model.
<- results |>
results_short unnest(models_results) |>
::filter(exposures == term) |>
dplyrselect(-models)
head(results_short)
# A tibble: 6 × 10
outcomes exposures term Estimate `Std. Error` `t value` `Pr(>|t|)` `2.5 %`
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 asthma creatini… crea… -1.67e-4 0.000111 -1.50 0.135 -3.85e-4
2 asthma lead lead 2.38e-3 0.00555 0.429 0.668 -8.49e-3
3 asthma barium bari… -3.03e-3 0.00212 -1.43 0.153 -7.18e-3
4 asthma cadmium cadm… 1.20e-3 0.0183 0.0654 0.948 -3.47e-2
5 heart_fai… creatini… crea… -2.77e-5 0.0000542 -0.511 0.609 -1.34e-4
6 heart_fai… lead lead -2.68e-3 0.00255 -1.05 0.293 -7.67e-3
# ℹ 2 more variables: `97.5 %` <dbl>, aic <dbl>
Now the table is simple and it’s possible to create a gt
table and to go handle all the formating
Create the gt
table
To create the table, follow these steps:
- Initialize the gt table object
library(gt)
Warning: package 'gt' was built under R version 4.3.3
<- gt(
gt_table data = results_short,
groupname_col = "outcomes"
)
gt_table
exposures | term | Estimate | Std. Error | t value | Pr(>|t|) | 2.5 % | 97.5 % | aic |
---|---|---|---|---|---|---|---|---|
asthma | ||||||||
creatinine | creatinine | -1.666218e-04 | 1.114468e-04 | -1.49507911 | 0.135049228 | -0.0003850536 | 5.180995e-05 | 1802.2351 |
lead | lead | 2.379870e-03 | 5.548226e-03 | 0.42894247 | 0.668010610 | -0.0084944533 | 1.325419e-02 | 1804.2892 |
barium | barium | -3.028811e-03 | 2.117844e-03 | -1.43013881 | 0.152831073 | -0.0071797093 | 1.122087e-03 | 1802.4252 |
cadmium | cadmium | 1.198378e-03 | 1.831328e-02 | 0.06543767 | 0.947831976 | -0.0346949935 | 3.709175e-02 | 1804.4692 |
heart_failure | ||||||||
creatinine | creatinine | -2.772027e-05 | 5.424036e-05 | -0.51106355 | 0.609367443 | -0.0001340294 | 7.858888e-05 | -1270.1983 |
lead | lead | -2.676537e-03 | 2.545335e-03 | -1.05154615 | 0.293144934 | -0.0076653009 | 2.312227e-03 | -1271.0444 |
barium | barium | -4.963558e-04 | 9.896629e-04 | -0.50154023 | 0.616050549 | -0.0024360595 | 1.443348e-03 | -1270.1887 |
cadmium | cadmium | 2.474920e-03 | 8.393849e-03 | 0.29484929 | 0.768142044 | -0.0139767213 | 1.892656e-02 | -1270.0237 |
coronary_heart_disease | ||||||||
creatinine | creatinine | -4.562735e-05 | 5.605650e-05 | -0.81395298 | 0.415776468 | -0.0001554961 | 6.424136e-05 | -1147.8785 |
lead | lead | -3.041374e-03 | 2.630681e-03 | -1.15611644 | 0.247782569 | -0.0081974150 | 2.114667e-03 | -1148.5536 |
barium | barium | -4.016007e-04 | 1.022938e-03 | -0.39259548 | 0.694663436 | -0.0024065216 | 1.603320e-03 | -1147.3691 |
cadmium | cadmium | 1.448436e-02 | 8.669517e-03 | 1.67072316 | 0.094945244 | -0.0025075783 | 3.147630e-02 | -1150.0099 |
heart_attack | ||||||||
creatinine | creatinine | -5.655582e-06 | 6.140079e-05 | -0.09210927 | 0.926621182 | -0.0001259989 | 1.146878e-04 | -809.6718 |
lead | lead | -3.008399e-03 | 2.881167e-03 | -1.04415985 | 0.296547637 | -0.0086553836 | 2.638585e-03 | -810.7556 |
barium | barium | -1.040694e-03 | 1.120050e-03 | -0.92914986 | 0.352932446 | -0.0032359522 | 1.154563e-03 | -810.5283 |
cadmium | cadmium | 2.642442e-02 | 9.481671e-03 | 2.78689532 | 0.005375826 | 0.0078406907 | 4.500816e-02 | -817.4306 |
- Format numbers
<- gt_table |>
gt_table fmt_number(
columns = c("Estimate", "2.5 %", "97.5 %"),
decimals = 3
|>
) fmt(
columns = "Pr(>|t|)",
fns = \(x) format.pval(x, digits = 3)
)
- Merge and hide columns
<- gt_table |>
gt_table cols_merge(
columns = c("2.5 %", "97.5 %"),
pattern = "{1} - {2}"
|>
) cols_hide(
columns = c("Std. Error", "t value", "term")
|>
) cols_label("2.5 %" = "95% CI")
- Add table title and subtitle
<- gt_table %>%
gt_table tab_header(
title = "Models results",
subtitle = "data = nh2007"
)
gt_table
Models results | ||||
---|---|---|---|---|
data = nh2007 | ||||
exposures | Estimate | Pr(>|t|) | 95% CI | aic |
asthma | ||||
creatinine | 0.000 | 0.13505 | 0.000 - 0.000 | 1802.2351 |
lead | 0.002 | 0.66801 | −0.008 - 0.013 | 1804.2892 |
barium | −0.003 | 0.15283 | −0.007 - 0.001 | 1802.4252 |
cadmium | 0.001 | 0.94783 | −0.035 - 0.037 | 1804.4692 |
heart_failure | ||||
creatinine | 0.000 | 0.60937 | 0.000 - 0.000 | -1270.1983 |
lead | −0.003 | 0.29314 | −0.008 - 0.002 | -1271.0444 |
barium | 0.000 | 0.61605 | −0.002 - 0.001 | -1270.1887 |
cadmium | 0.002 | 0.76814 | −0.014 - 0.019 | -1270.0237 |
coronary_heart_disease | ||||
creatinine | 0.000 | 0.41578 | 0.000 - 0.000 | -1147.8785 |
lead | −0.003 | 0.24778 | −0.008 - 0.002 | -1148.5536 |
barium | 0.000 | 0.69466 | −0.002 - 0.002 | -1147.3691 |
cadmium | 0.014 | 0.09495 | −0.003 - 0.031 | -1150.0099 |
heart_attack | ||||
creatinine | 0.000 | 0.92662 | 0.000 - 0.000 | -809.6718 |
lead | −0.003 | 0.29655 | −0.009 - 0.003 | -810.7556 |
barium | −0.001 | 0.35293 | −0.003 - 0.001 | -810.5283 |
cadmium | 0.026 | 0.00538 | 0.008 - 0.045 | -817.4306 |
All these steps in the gt pipeline can be grouped into one function, for example, gt_models()
, with only results_clean
as an argument. This means that you can easily reapply the function to each results table and get a identical table!
Exercice: Create this function and add it to R/gt_models.R
.
<- function(results_clean) {
gt_models library(gt)
# 1. Initialize the gt table object
gt(
data = results_clean,
groupname_col = "outcomes"
|>
) # 2. Format numbers
fmt_number(
columns = c("Estimate", "2.5 %", "97.5 %"),
decimals = 3
|>
) fmt(
columns = "Pr(>|t|)",
fns = \(x) format.pval(x, digits = 3)
|>
) # 3. Merge and hide columns
cols_merge(
columns = c("2.5 %", "97.5 %"),
pattern = "{1} - {2}"
|>
) cols_hide(
columns = c("Std. Error", "t value", "term")
|>
) cols_label("2.5 %" = "95% CI") |>
# 4. Add table title and subtitle
tab_header(
title = "Models results"
)
}
Remember to commit this new function and to source it in your main script.
Generate the report with quarto
Now that we have many functions and a nice table, we can put them in a Quarto document to generate either HTML
or DOCX
documents.
In RStudio create a new file: qmd/report.qmd
In the report we should include a yaml
header defining some execution parameters:
---
title: "NHANES Report"
author: "Your Name"
date: "2024-05-31"
output: html_document # or docx
---
R code can then be executed within chunks. The first chunk should load all the packages and source all your functions:
```{r}
# Load packages
library(tidyverse)
source("../R/descriptive.R")
source("../R/models.R")
source("../R/gt_models.R")
# .. is needed to refer to the project root
# Alternative using the here package
source(here("R/descriptive.R"))
source(here("R/models.R"))
source(here("R/gt_models.R"))
```
In between each chunk, you can add text to describe what you did, or even write some method points that you need to remember for later. The syntax in this part is done in markdown.
The following chunks can contain all the lines that include map functions.
Exercise: Convert your starting script into a Quarto document.
- Create the file
qmd/report.qmd
- Open it and add the yaml header and a chunck to load all the package and library.
- Add as many chuncks as you wish for each step of the analysis.
You can check the following file to see a example: report example
2009 report
Now that the functions are all created, adding the analysis for the nh2009
dataset is simple, as it only requires applying the same functions to another dataset.
The report can be divided into two sections:
# 2007
... 2007 analysis
# 2009
... 2009 analysis
The only thing that needs to be copied and pasted is the data management part. The script for the 2009 analysis can look like this:
load(here("data/nh2009.RData"))
# Descriptive stats
::map(.x = nh2009, .f = compute_descriptive_stats) |>
purrr::bind_rows(.id = "column")
dplyr
# List outcomes
<- c("asthma", "heart_failure", "coronary_heart_disease", "heart_attack")
outcomes <- c("creatinine", "lead", "barium", "cadmium")
exposures
<- tidyr::expand_grid(outcomes, exposures)
models_parameters_2009
<- map2(
models_2009 .x = models_parameters_2009$outcomes,
.y = models_parameters_2009$exposures,
.f = \(x, y) build_model(x, y, dataset = nh2009) # we need to change the dataset
)
# Extract model results
<- map(models_2009, extract_model_result)
models_results_2009
<- models_parameters_2009 |>
results_2009 ::mutate(
dplyrmodels = models_2009,
models_results = models_results_2009
)
<- results_2009 |>
results_model_clean_2009 unnest(models_results) |>
::filter(exposures == term) |>
dplyrselect(-models)
gt_models(results_model_clean_2009)
Exercice: Intergrate the following lines in the report document
Check the following files to see the final state of the project:
- report_example.qmd which contains computation and results. Clicking on “Render” in RStudio or using `quarto render qmd/report_example.qmd’. The rendered report can be seen here: report example
- R/descriptive.R, R/models.R, and R/gt_models.R to see all the functions definitions