# Numbers
compute_descriptive_stats(nh2007$gender)
compute_descriptive_stats(nh2007$education)
compute_descriptive_stats(nh2007$education_child)
compute_descriptive_stats(nh2007$asthma)
compute_descriptive_stats(nh2007$heart_failure)
compute_descriptive_stats(nh2007$coronary_heart_disease)
compute_descriptive_stats(nh2007$creatinine)
compute_descriptive_stats(nh2007$lead)
compute_descriptive_stats(nh2007$barium)
compute_descriptive_stats(nh2007$cadmium)
# Graph
compute_descriptive_graph(nh2007$creatinine)
compute_descriptive_graph(nh2007$lead)
compute_descriptive_graph(nh2007$barium)
compute_descriptive_graph(nh2007$cadmium)Hands-on Activity 2
Introduction
Goal
At the end of this exercise, you should be able to produce an HTML document containing at least one table to summarize model results using gt and quarto with the nh2007 dataset. If we have enough time, you can reproduce the analysis with the nh2009 dataset.
Tasks
- Improve and simplify the code using functional programming
- Create a QMD document to generate a report using Quarto
- Make a table using
gtto summarize model results - Continue to use Git to save your progress
Functional Programming
First use of map()
Let’s look at the script that we finished before the break. We created some functions that we now use for all the descriptive statistics. However, we still have to copy and paste the same line many times.
Exercise: Try to use map(), from the purrr:: package, to avoid copy pasting?
map() can take a data frame as its first argument .x and it will apply the function to each column of the data frame:
purrr::map(.x = nh2007, .f = compute_descriptive_stats)This will output a list which look like this:
If we replace the .f argument by another function, like compute_numeric() or compute_descriptive_graph() this will work but, only if the column is of the right type. The following will fail, as mean cannot be calculated for categorical variables:
purrr::map(.x = nh2007, .f = compute_numeric)To make it work, we need to restrict to a subset of columns using dplyr::select():
nh2007 |>
# Retrict to numeric variables
dplyr::select(dplyr::where(fn = is.numeric)) |>
purrr::map(.f = compute_numeric)
nh2007 |>
# Restrict to variables of interest for graphs
dplyr::select(creatinine, lead, barium, cadmium) |>
purrr::map(.f = compute_descriptive_graph)The list that map() return are not so easy to work with because functions are mainly build to use a dataframe. However it’s very easy to modify compute_table() and compute_numeric() to return the results in a different format.
When the two functions return a dataframe, it’s will be possible to use dplyr::bind_rows() to bind all the elements of the list from map into a single dataframe.
Let’s modify the two compute functions:
compute_table <- function(variable) {
# Return frequency table as a dataframe
table(variable, useNA = "always", dnn = "level") |>
as.data.frame()
}
compute_numeric <- function(variable) {
mean_value <- mean(variable, na.rm = TRUE)
sd_value <- sd(variable, na.rm = TRUE)
quantiles <- quantile(variable, na.rm = TRUE)
# Return statistics as a dataframe
cbind(
data.frame(
"mean" = mean_value,
"sd" = sd_value
),
t(quantiles)
)
}Save the function file, and source it to load the updated version of the function into our environment. Then we can try the following lines and combine map() results with bind_rows():
purrr::map(.x = nh2007, .f = compute_descriptive_stats) |>
dplyr::bind_rows(.id = "column") |>
head(5) column mean sd 0% 25% 50% 75% 100% level
1 id 46562.61406 2966.3259 41477 44060.75 46513 49178.75 51622 <NA>
2 gender NA NA NA NA NA NA NA 1
3 gender NA NA NA NA NA NA NA 2
4 gender NA NA NA NA NA NA NA <NA>
5 age_screening 47.41593 19.2215 16 31.00 47 63.00 80 <NA>
Freq
1 NA
2 1022
3 1012
4 0
5 NA
The result is already much better and easier to work with. Notice that we didn’t had to change the compute_descriptive_stats() function, only the computation functions, compute_table() and compute_numeric(). This is because no calculations are happening in compute_descriptive_stats(), it is only calling the appropriate compute function for each type of variable.
We have done some modification to our function it’s a good idea to commit the changes so you can revert to the previous version of the function if needed.
Jump back in Gitnuro, verify the modifications you just introduce, stage them, find a good name for the commit and press the commit button 😊
Create models with map()
We can also use functional programming to simplify the creation of the models. For the 4 outcomes and the 4 exposures, the models are always the same:
# Creatinine
model.1a <- glm(asthma ~ barium + age_screening + gender, data = nh2007)
model.1.b <- glm(heart_failure ~ barium + age_screening + gender, data = nh2007)
model.1.c <- glm(coronary_heart_disease ~ barium + age_screening + gender, data = nh2007)
model.1.d <- glm(heart_attack ~ barium + age_screening + gender, data = nh2007)
model.1.e <- glm(asthma ~ barium + age_screening + gender, data = nh2007) # Notice that this model is identical to 1a, but it's hard to noticeWe can create a new function! This function need to have at least 2 arugments: one argument for the outcome and another for the exposure. It’s also a good idea to add another argument to specify the dataset to use in the model. The function can look like that .
map2() and pmap() are variant of map() that can iterate over two or many arguments simultaneously.
All the functions that we will create for the models can be saved in a new file R/models.R. This will help to quickly find functions based on their type when we need to modify them. Remember to add a line to source this file in the starting script.
Exercise: Create a function to build the glm models. Use map2() to apply the function to all combinations of outcomes and exposures. Here are some useful pieces of code that you can use:
By looking at the code, we can see that the outcomes and exposures are the
outcomes <- c("asthma", "heart_failure", "coronary_heart_disease", "heart_attack")
exposures <- c("creatinine", "lead", "barium", "cadmium")We can create a formula using paste() and as.formula():
formula <- paste0("asthma", " ~ ", "creatinine", " + age_screening + gender" ) |>
as.formula()
formulaasthma ~ creatinine + age_screening + gender
tidyr::expand_grid() creates a table from combinations of vectors. You can use it to get all the combinations of outcomes and exposures.
tidyr::expand_grid(outcomes, exposures)# A tibble: 16 × 2
outcomes exposures
<chr> <chr>
1 asthma creatinine
2 asthma lead
3 asthma barium
4 asthma cadmium
5 heart_failure creatinine
6 heart_failure lead
7 heart_failure barium
8 heart_failure cadmium
9 coronary_heart_disease creatinine
10 coronary_heart_disease lead
11 coronary_heart_disease barium
12 coronary_heart_disease cadmium
13 heart_attack creatinine
14 heart_attack lead
15 heart_attack barium
16 heart_attack cadmium
With these three things, you can create a table containing the combination of outcomes and exposures and create a function that take every row of the table to build all the models from it.
build_model <- function(outcome, exposure, dataset) {
formula <- paste0(
outcome, " ~ ",
exposure, " + age_screening + gender"
) |>
as.formula()
try(
glm(formula, data = dataset)
)
}
# List outcomes
outcomes <- c("asthma", "heart_failure", "coronary_heart_disease", "heart_attack")
exposures <- c("creatinine", "lead", "barium", "cadmium")
models_parameters <- tidyr::expand_grid(outcomes, exposures)
models <- map2(
.x = models_parameters$outcomes,
.y = models_parameters$exposures,
.f = function(x, y) build_model(x, y, dataset = nh2007)
)try()is used to catch any errors when the models don’t run for any reason. This allowsmapto continue instead of stopping..f = function(x, y) build_model(x, y, dataset = nh2007)is an anonymous function. It helps pass a common element to all the models. Here, it’s the dataset that we want to use for the models. Of course, this dataset can also be passed directly bypmapto make the dataset vary, or written in the build model function, but it can be a little tricky.
Once again, all the models are in a list. It’s easy to access a model using indexes like models[[1]], but it’s a bit crude. It is possible to improve that by:
- Assigning names to each list element from the model parameters, so you can access models using a
$sign. - Putting the models in a column in the models parameters dataframe (often a better solution).
This is a big modification, take some time to move build_model() to the R/models.R file, rewrite the code in the starting script and commit the changes in Gitnuro!
Extract models results with map()
We have seen how to use the map() function family to loop through columns of a dataframe and to create many models from a dataframe. Their use is not limited to that. It’s possible to extract the results of many models too.
In the starting-script.R, many line just print the summary of each model. A simple solution would be to replace them with the following code:
map(models, summary)This will print the summary of all the models created at the previous steps. The inconvenient of the summary function is that it’s not easy to export as a table.
Exercice: Create a function to extract estimates, CI, p-value and AIC from a model. Use map to apply it to every model.
It can be difficult to create a function from scratch, especially when working with lists. It’s sometimes easier to first work with one element of the list to test the function.
test_model <- models[[1]]Check the following functions to help you:
coef(summary(test_model)) Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8584448409 0.0321187539 57.861673 0.0000000000
creatinine -0.0001666218 0.0001114468 -1.495079 0.1350492277
age_screening 0.0012687634 0.0004504500 2.816658 0.0048995453
gender2 -0.0618565055 0.0171497804 -3.606840 0.0003174377
confint(test_model)Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 1.7954932401 1.921396e+00
creatinine -0.0003850536 5.180995e-05
age_screening 0.0003858977 2.151629e-03
gender2 -0.0954694575 -2.824355e-02
test_model$aic[1] 1802.235
cbind(confint(test_model), test_model$aic)Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 1.7954932401 1.921396e+00 1802.235
creatinine -0.0003850536 5.180995e-05 1802.235
age_screening 0.0003858977 2.151629e-03 1802.235
gender2 -0.0954694575 -2.824355e-02 1802.235
extract_model_result <- function(model) {
# Get coefficients
coefs <- coef(summary(model)) |>
as.data.frame()
# confidence interval
ci <- confint(model) |>
as.data.frame()
# AIC
aic <- model$aic
# Return a dataframe
cbind(
coefs,
ci,
aic
) |>
# Get model variables in a column
rownames_to_column(var = "term")
}
# Test the function with one model
extract_model_result(test_model) term Estimate Std. Error t value Pr(>|t|) 2.5 %
1 (Intercept) 1.8584448409 0.0321187539 57.861673 0.0000000000 1.7954932401
2 creatinine -0.0001666218 0.0001114468 -1.495079 0.1350492277 -0.0003850536
3 age_screening 0.0012687634 0.0004504500 2.816658 0.0048995453 0.0003858977
4 gender2 -0.0618565055 0.0171497804 -3.606840 0.0003174377 -0.0954694575
97.5 % aic
1 1.921396e+00 1802.235
2 5.180995e-05 1802.235
3 2.151629e-03 1802.235
4 -2.824355e-02 1802.235
# Extract all model results
models_results <- map(models, extract_model_result)The broom: package is a very nice interface to reliably extract models results in a consistent way. The tidy() function can easily replace this custom-made function: https://broom.tidymodels.org/
It’s time to do another commit! Move extract_model_result() to the R/models.R file, rewrite the code in the starting script and commit the changes!
Present the model results with a table using gt
In this section, we will focus on creating just one table because the gt package offers many options, which could be the topic of an entire workshop.
The gt package is a powerful tool for creating presentation-ready tables in R, allowing for extensive customization and styling. With gt, you can easily format numbers, add headers and footers, apply conditional formatting, and even integrate visual elements directly into your tables. Unlike manual formatting in Excel, which is time-consuming, `gt automates the process, ensuring consistency and saving time.
Feel free to experiment with it later if you have time, for example, to create a table for the descriptive statistics.
Prepare the data
Before creating a gt table, it’s best to have a clean table with only the information that we want to present. The first step is to clean the model parameters, models, and model results into one table.
# Add models and results to the models_parameter dataframe
results <- models_parameters |>
dplyr::mutate(
models = models,
models_results = models_results
)Then we can unnest() the models_results column in the dataframe, and filter the terms from the model results to keep only the terms matching the exposure of the model.
results_short <- results |>
unnest(models_results) |>
dplyr::filter(exposures == term) |>
select(-models)
head(results_short)# A tibble: 6 × 10
outcomes exposures term Estimate `Std. Error` `t value` `Pr(>|t|)` `2.5 %`
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 asthma creatini… crea… -1.67e-4 0.000111 -1.50 0.135 -3.85e-4
2 asthma lead lead 2.38e-3 0.00555 0.429 0.668 -8.49e-3
3 asthma barium bari… -3.03e-3 0.00212 -1.43 0.153 -7.18e-3
4 asthma cadmium cadm… 1.20e-3 0.0183 0.0654 0.948 -3.47e-2
5 heart_fai… creatini… crea… -2.77e-5 0.0000542 -0.511 0.609 -1.34e-4
6 heart_fai… lead lead -2.68e-3 0.00255 -1.05 0.293 -7.67e-3
# ℹ 2 more variables: `97.5 %` <dbl>, aic <dbl>
Now the table is simpler. Let’s create a gt table and apply all the formating
Create the gt table
To create the table, follow these steps:
- Initialize the gt table object
library(gt)
gt_table <- gt(
data = results_short,
groupname_col = "outcomes"
)- Format numbers
gt_table <- gt_table |>
fmt_number(
columns = c("Estimate", "2.5 %", "97.5 %"),
decimals = 3
) |>
fmt(
columns = "Pr(>|t|)",
fns = function(x) format.pval(x, digits = 3)
)- Merge and hide columns
gt_table <- gt_table |>
# Merge confidence interval values together
cols_merge(
columns = c("2.5 %", "97.5 %"),
pattern = "{1} - {2}"
) |>
# Hide non essential columns
cols_hide(
columns = c("Std. Error", "t value", "term", "aic")
) |>
# Change column label
cols_label("2.5 %" = "95% CI")- Add table title and subtitle
gt_table <- gt_table %>%
tab_header(
title = "Models results",
subtitle = "data = nh2007"
)
gt_table| Models results | |||
| data = nh2007 | |||
| exposures | Estimate | Pr(>|t|) | 95% CI |
|---|---|---|---|
| asthma | |||
| creatinine | 0.000 | 0.13505 | 0.000 - 0.000 |
| lead | 0.002 | 0.66801 | −0.008 - 0.013 |
| barium | −0.003 | 0.15283 | −0.007 - 0.001 |
| cadmium | 0.001 | 0.94783 | −0.035 - 0.037 |
| heart_failure | |||
| creatinine | 0.000 | 0.60937 | 0.000 - 0.000 |
| lead | −0.003 | 0.29314 | −0.008 - 0.002 |
| barium | 0.000 | 0.61605 | −0.002 - 0.001 |
| cadmium | 0.002 | 0.76814 | −0.014 - 0.019 |
| coronary_heart_disease | |||
| creatinine | 0.000 | 0.41578 | 0.000 - 0.000 |
| lead | −0.003 | 0.24778 | −0.008 - 0.002 |
| barium | 0.000 | 0.69466 | −0.002 - 0.002 |
| cadmium | 0.014 | 0.09495 | −0.003 - 0.031 |
| heart_attack | |||
| creatinine | 0.000 | 0.92662 | 0.000 - 0.000 |
| lead | −0.003 | 0.29655 | −0.009 - 0.003 |
| barium | −0.001 | 0.35293 | −0.003 - 0.001 |
| cadmium | 0.026 | 0.00538 | 0.008 - 0.045 |
All these steps in the gt pipeline can be grouped into one function, for example, gt_models(), with only a results table as an argument. This means that you can easily apply the function to new model results and get a identical table!
Exercice: Create this function, add it to R/gt_models.R and add a line to source the file in your maim script.
gt_models <- function(results_clean) {
library(gt)
# 1. Initialize the gt table object
gt(
data = results_clean,
groupname_col = "outcomes"
) |>
# 2. Format numbers
fmt_number(
columns = c("Estimate", "2.5 %", "97.5 %"),
decimals = 3
) |>
fmt(
columns = "Pr(>|t|)",
fns = \(x) format.pval(x, digits = 3)
) |>
# 3. Merge and hide columns
cols_merge(
columns = c("2.5 %", "97.5 %"),
pattern = "{1} - {2}"
) |>
cols_hide(
columns = c("Std. Error", "t value", "term", "aic")
) |>
cols_label("2.5 %" = "95% CI") |>
# 4. Add table title and subtitle
tab_header(
title = "Models results"
)
}
# Testing the function
gt_models(results_short)| Models results | |||
| exposures | Estimate | Pr(>|t|) | 95% CI |
|---|---|---|---|
| asthma | |||
| creatinine | 0.000 | 0.13505 | 0.000 - 0.000 |
| lead | 0.002 | 0.66801 | −0.008 - 0.013 |
| barium | −0.003 | 0.15283 | −0.007 - 0.001 |
| cadmium | 0.001 | 0.94783 | −0.035 - 0.037 |
| heart_failure | |||
| creatinine | 0.000 | 0.60937 | 0.000 - 0.000 |
| lead | −0.003 | 0.29314 | −0.008 - 0.002 |
| barium | 0.000 | 0.61605 | −0.002 - 0.001 |
| cadmium | 0.002 | 0.76814 | −0.014 - 0.019 |
| coronary_heart_disease | |||
| creatinine | 0.000 | 0.41578 | 0.000 - 0.000 |
| lead | −0.003 | 0.24778 | −0.008 - 0.002 |
| barium | 0.000 | 0.69466 | −0.002 - 0.002 |
| cadmium | 0.014 | 0.09495 | −0.003 - 0.031 |
| heart_attack | |||
| creatinine | 0.000 | 0.92662 | 0.000 - 0.000 |
| lead | −0.003 | 0.29655 | −0.009 - 0.003 |
| barium | −0.001 | 0.35293 | −0.003 - 0.001 |
| cadmium | 0.026 | 0.00538 | 0.008 - 0.045 |
Remember to commit the new changes.
Generate a report with quarto
2007 report
Now we can put the nice table in a Quarto document to generate directly HTML or DOCX documents.
In RStudio create a new file: qmd/report.qmd In the report we should include a yaml header defining some execution parameters:
---
title: "NHANES Report"
author: "Your Name"
date: "date"
output: html_document # or docx
---R code can then be executed within chunks. The first chunk should load all the packages and source all the functions:
```{r}
# Load packages
library(tidyverse)
library(here)
# Loading the functions
# here() gets the location of the project
source(here("R/descriptive.R"))
source(here("R/models.R"))
source(here("R/gt_models.R"))
# Alternative is to use .. to refer to the project root
# source("../R/descriptive.R")
# source("../R/models.R")
# source("../R/gt_models.R")
# .. refer to the project root (one folder above the current one)
```In between each chunk, you can add text to describe what you did, or even write some method points that you need to remember for later. The syntax in this part is done in markdown.
The following chunks can contain all the lines that include map functions.
Exercise: Convert your starting script into a Quarto document.
- Create the file
qmd/report.qmd - Open it and add the yaml header and a chunck to load all the package and library.
- Add as many chuncks as you wish for each step of the analysis.
You can check the following file to see a example: report example
2009 report
This report is only based on the 2007 data. Since the 2009 data in the nh2009 dataset is very similar, we can reuse the functions with it. It’s very easy to replicate the analysis with new data: it only requires applying the same functions to the new dataset.
Let’s create a new section in the report:
# 2007
... 2007 analysis
# 2009
... 2009 analysisThe script for the 2009 analysis can look like this:
load(here("data/nh2009.RData"))
# Descriptive stats
purrr::map(.x = nh2009, .f = compute_descriptive_stats) |>
dplyr::bind_rows(.id = "column")
# List outcomes
outcomes <- c("asthma", "heart_failure", "coronary_heart_disease", "heart_attack")
exposures <- c("creatinine", "lead", "barium", "cadmium")
models_parameters_2009 <- tidyr::expand_grid(outcomes, exposures)
models_2009 <- map2(
.x = models_parameters_2009$outcomes,
.y = models_parameters_2009$exposures,
.f = \(x, y) build_model(x, y, dataset = nh2009) # we need to change the dataset
)
# Extract model results
models_results_2009 <- map(models_2009, extract_model_result)
results_2009 <- models_parameters_2009 |>
dplyr::mutate(
models = models_2009,
models_results = models_results_2009
)
results_model_clean_2009 <- results_2009 |>
unnest(models_results) |>
dplyr::filter(exposures == term) |>
select(-models)
gt_models(results_model_clean_2009)Exercice: Integrate the following lines in the report document, and render the document (Button in RStudio toolbar).
Congratulation, you finished the whole workshop!
You can check the following files to see the final files of the project:
- report_example.qmd which contains the code of the report. Click on “Render” in RStudio or type
quarto render qmd/report_example.qmdin the terminal. The rendered report can be seen here: report example - R/descriptive.R, R/models.R, and R/gt_models.R to see all the functions definitions
{fig-align=“center”)