Hands-on Activity 2

Published

February 3, 2026

Introduction

Goal

At the end of this exercise, you should be able to produce an HTML document containing at least one table to summarize model results using gt and quarto with the nh2007 dataset. If we have enough time, you can reproduce the analysis with the nh2009 dataset.

Tasks

  • Improve and simplify the code using functional programming
  • Create a QMD document to generate a report using Quarto
  • Make a table using gt to summarize model results
  • Continue to use Git to save your progress

Functional Programming

First use of map()

Let’s look at the script that we finished before the break. We created some functions that we now use for all the descriptive statistics. However, we still have to copy and paste the same line many times.

# Numbers
compute_descriptive_stats(nh2007$gender)
compute_descriptive_stats(nh2007$education)
compute_descriptive_stats(nh2007$education_child)
compute_descriptive_stats(nh2007$asthma)
compute_descriptive_stats(nh2007$heart_failure)
compute_descriptive_stats(nh2007$coronary_heart_disease)
compute_descriptive_stats(nh2007$creatinine)
compute_descriptive_stats(nh2007$lead)
compute_descriptive_stats(nh2007$barium)
compute_descriptive_stats(nh2007$cadmium)


# Graph
compute_descriptive_graph(nh2007$creatinine)
compute_descriptive_graph(nh2007$lead)
compute_descriptive_graph(nh2007$barium)
compute_descriptive_graph(nh2007$cadmium)
ImportantExercice

Exercise: Try to use map(), from the purrr:: package, to avoid copy pasting?

map() can take a data frame as its first argument .x and it will apply the function to each column of the data frame:

purrr::map(.x = nh2007, .f = compute_descriptive_stats)

This will output a list which look like this:

If we replace the .f argument by another function, like compute_numeric() or compute_descriptive_graph() this will work but, only if the column is of the right type. The following will fail, as mean cannot be calculated for categorical variables:

purrr::map(.x = nh2007, .f = compute_numeric)

To make it work, we need to restrict to a subset of columns using dplyr::select():

nh2007 |>
  # Retrict to numeric variables
  dplyr::select(dplyr::where(fn = is.numeric)) |>
  purrr::map(.f = compute_numeric)

nh2007 |>
  # Restrict to variables of interest for graphs
  dplyr::select(creatinine, lead, barium, cadmium) |>
  purrr::map(.f = compute_descriptive_graph)

The list that map() return are not so easy to work with because functions are mainly build to use a dataframe. However it’s very easy to modify compute_table() and compute_numeric() to return the results in a different format.

When the two functions return a dataframe, it’s will be possible to use dplyr::bind_rows() to bind all the elements of the list from map into a single dataframe.

Let’s modify the two compute functions:

compute_table <- function(variable) {
  # Return frequency table as a dataframe
  table(variable, useNA = "always", dnn = "level") |>
    as.data.frame() 
}

compute_numeric <- function(variable) {
  mean_value <- mean(variable, na.rm = TRUE)
  sd_value <- sd(variable, na.rm = TRUE)
  quantiles <- quantile(variable, na.rm = TRUE)

  # Return statistics as a dataframe
  cbind(
    data.frame(
      "mean" = mean_value,
      "sd" = sd_value
    ),
    t(quantiles)
  )
}

Save the function file, and source it to load the updated version of the function into our environment. Then we can try the following lines and combine map() results with bind_rows():

purrr::map(.x = nh2007, .f = compute_descriptive_stats) |>
  dplyr::bind_rows(.id = "column") |>
  head(5)
         column        mean        sd    0%      25%   50%      75%  100% level
1            id 46562.61406 2966.3259 41477 44060.75 46513 49178.75 51622  <NA>
2        gender          NA        NA    NA       NA    NA       NA    NA     1
3        gender          NA        NA    NA       NA    NA       NA    NA     2
4        gender          NA        NA    NA       NA    NA       NA    NA  <NA>
5 age_screening    47.41593   19.2215    16    31.00    47    63.00    80  <NA>
  Freq
1   NA
2 1022
3 1012
4    0
5   NA

The result is already much better and easier to work with. Notice that we didn’t had to change the compute_descriptive_stats() function, only the computation functions, compute_table() and compute_numeric(). This is because no calculations are happening in compute_descriptive_stats(), it is only calling the appropriate compute function for each type of variable.

Tip

We have done some modification to our function it’s a good idea to commit the changes so you can revert to the previous version of the function if needed.

Jump back in Gitnuro, verify the modifications you just introduce, stage them, find a good name for the commit and press the commit button 😊

Create models with map()

We can also use functional programming to simplify the creation of the models. For the 4 outcomes and the 4 exposures, the models are always the same:

# Creatinine
model.1a <- glm(asthma ~ barium + age_screening + gender, data = nh2007)
model.1.b <- glm(heart_failure ~ barium + age_screening + gender, data = nh2007)
model.1.c <- glm(coronary_heart_disease ~ barium + age_screening + gender, data = nh2007)
model.1.d <- glm(heart_attack ~ barium + age_screening + gender, data = nh2007)
model.1.e <- glm(asthma ~ barium + age_screening + gender, data = nh2007) # Notice that this model is identical to 1a, but it's hard to notice

We can create a new function! This function need to have at least 2 arugments: one argument for the outcome and another for the exposure. It’s also a good idea to add another argument to specify the dataset to use in the model. The function can look like that .

map2() and pmap() are variant of map() that can iterate over two or many arguments simultaneously.

All the functions that we will create for the models can be saved in a new file R/models.R. This will help to quickly find functions based on their type when we need to modify them. Remember to add a line to source this file in the starting script.

ImportantExercice

Exercise: Create a function to build the glm models. Use map2() to apply the function to all combinations of outcomes and exposures. Here are some useful pieces of code that you can use:

By looking at the code, we can see that the outcomes and exposures are the

outcomes <- c("asthma", "heart_failure", "coronary_heart_disease", "heart_attack")
exposures <- c("creatinine", "lead", "barium", "cadmium")

We can create a formula using paste() and as.formula():

formula <- paste0("asthma", " ~ ", "creatinine", " + age_screening + gender" ) |>
    as.formula()

formula
asthma ~ creatinine + age_screening + gender

tidyr::expand_grid() creates a table from combinations of vectors. You can use it to get all the combinations of outcomes and exposures.

tidyr::expand_grid(outcomes, exposures)
# A tibble: 16 × 2
   outcomes               exposures 
   <chr>                  <chr>     
 1 asthma                 creatinine
 2 asthma                 lead      
 3 asthma                 barium    
 4 asthma                 cadmium   
 5 heart_failure          creatinine
 6 heart_failure          lead      
 7 heart_failure          barium    
 8 heart_failure          cadmium   
 9 coronary_heart_disease creatinine
10 coronary_heart_disease lead      
11 coronary_heart_disease barium    
12 coronary_heart_disease cadmium   
13 heart_attack           creatinine
14 heart_attack           lead      
15 heart_attack           barium    
16 heart_attack           cadmium   

With these three things, you can create a table containing the combination of outcomes and exposures and create a function that take every row of the table to build all the models from it.

build_model <- function(outcome, exposure, dataset) {
  formula <- paste0(
    outcome, " ~ ",
    exposure, " + age_screening + gender"
  ) |>
    as.formula()

  try(
    glm(formula, data = dataset)
  )
}

# List outcomes
outcomes <- c("asthma", "heart_failure", "coronary_heart_disease", "heart_attack")
exposures <- c("creatinine", "lead", "barium", "cadmium")


models_parameters <- tidyr::expand_grid(outcomes, exposures)

models <- map2(
  .x = models_parameters$outcomes,
  .y = models_parameters$exposures,
  .f = function(x, y) build_model(x, y, dataset = nh2007)
)
  • try() is used to catch any errors when the models don’t run for any reason. This allows map to continue instead of stopping.
  • .f = function(x, y) build_model(x, y, dataset = nh2007) is an anonymous function. It helps pass a common element to all the models. Here, it’s the dataset that we want to use for the models. Of course, this dataset can also be passed directly by pmap to make the dataset vary, or written in the build model function, but it can be a little tricky.

Once again, all the models are in a list. It’s easy to access a model using indexes like models[[1]], but it’s a bit crude. It is possible to improve that by:

  • Assigning names to each list element from the model parameters, so you can access models using a $ sign.
  • Putting the models in a column in the models parameters dataframe (often a better solution).
Tip

This is a big modification, take some time to move build_model() to the R/models.R file, rewrite the code in the starting script and commit the changes in Gitnuro!

Extract models results with map()

We have seen how to use the map() function family to loop through columns of a dataframe and to create many models from a dataframe. Their use is not limited to that. It’s possible to extract the results of many models too.

In the starting-script.R, many line just print the summary of each model. A simple solution would be to replace them with the following code:

map(models, summary)

This will print the summary of all the models created at the previous steps. The inconvenient of the summary function is that it’s not easy to export as a table.

ImportantExercice

Exercice: Create a function to extract estimates, CI, p-value and AIC from a model. Use map to apply it to every model.

It can be difficult to create a function from scratch, especially when working with lists. It’s sometimes easier to first work with one element of the list to test the function.

test_model <- models[[1]]

Check the following functions to help you:

coef(summary(test_model))
                   Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)    1.8584448409 0.0321187539 57.861673 0.0000000000
creatinine    -0.0001666218 0.0001114468 -1.495079 0.1350492277
age_screening  0.0012687634 0.0004504500  2.816658 0.0048995453
gender2       -0.0618565055 0.0171497804 -3.606840 0.0003174377
confint(test_model)
Waiting for profiling to be done...
                      2.5 %        97.5 %
(Intercept)    1.7954932401  1.921396e+00
creatinine    -0.0003850536  5.180995e-05
age_screening  0.0003858977  2.151629e-03
gender2       -0.0954694575 -2.824355e-02
test_model$aic
[1] 1802.235
cbind(confint(test_model), test_model$aic)
Waiting for profiling to be done...
                      2.5 %        97.5 %         
(Intercept)    1.7954932401  1.921396e+00 1802.235
creatinine    -0.0003850536  5.180995e-05 1802.235
age_screening  0.0003858977  2.151629e-03 1802.235
gender2       -0.0954694575 -2.824355e-02 1802.235
extract_model_result <- function(model) {
  # Get coefficients
  coefs <- coef(summary(model)) |>
    as.data.frame()

  # confidence interval
  ci <- confint(model) |>
    as.data.frame()

  # AIC
  aic <- model$aic

  # Return a dataframe
  cbind(
    coefs,
    ci,
    aic
  ) |>
    # Get model variables in a column
    rownames_to_column(var = "term")
}

# Test the function with one model
extract_model_result(test_model)
           term      Estimate   Std. Error   t value     Pr(>|t|)         2.5 %
1   (Intercept)  1.8584448409 0.0321187539 57.861673 0.0000000000  1.7954932401
2    creatinine -0.0001666218 0.0001114468 -1.495079 0.1350492277 -0.0003850536
3 age_screening  0.0012687634 0.0004504500  2.816658 0.0048995453  0.0003858977
4       gender2 -0.0618565055 0.0171497804 -3.606840 0.0003174377 -0.0954694575
         97.5 %      aic
1  1.921396e+00 1802.235
2  5.180995e-05 1802.235
3  2.151629e-03 1802.235
4 -2.824355e-02 1802.235
# Extract all model results
models_results <- map(models, extract_model_result)

The broom: package is a very nice interface to reliably extract models results in a consistent way. The tidy() function can easily replace this custom-made function: https://broom.tidymodels.org/

Tip

It’s time to do another commit! Move extract_model_result() to the R/models.R file, rewrite the code in the starting script and commit the changes!

Present the model results with a table using gt

In this section, we will focus on creating just one table because the gt package offers many options, which could be the topic of an entire workshop.

The gt package is a powerful tool for creating presentation-ready tables in R, allowing for extensive customization and styling. With gt, you can easily format numbers, add headers and footers, apply conditional formatting, and even integrate visual elements directly into your tables. Unlike manual formatting in Excel, which is time-consuming, `gt automates the process, ensuring consistency and saving time.

Feel free to experiment with it later if you have time, for example, to create a table for the descriptive statistics.

Prepare the data

Before creating a gt table, it’s best to have a clean table with only the information that we want to present. The first step is to clean the model parameters, models, and model results into one table.

# Add models and results to the models_parameter dataframe
results <- models_parameters |>
  dplyr::mutate(
    models = models,
    models_results = models_results
  )

Then we can unnest() the models_results column in the dataframe, and filter the terms from the model results to keep only the terms matching the exposure of the model.

results_short <- results |>
  unnest(models_results) |>
  dplyr::filter(exposures == term) |>
  select(-models)

head(results_short)
# A tibble: 6 × 10
  outcomes   exposures term  Estimate `Std. Error` `t value` `Pr(>|t|)`  `2.5 %`
  <chr>      <chr>     <chr>    <dbl>        <dbl>     <dbl>      <dbl>    <dbl>
1 asthma     creatini… crea… -1.67e-4    0.000111    -1.50        0.135 -3.85e-4
2 asthma     lead      lead   2.38e-3    0.00555      0.429       0.668 -8.49e-3
3 asthma     barium    bari… -3.03e-3    0.00212     -1.43        0.153 -7.18e-3
4 asthma     cadmium   cadm…  1.20e-3    0.0183       0.0654      0.948 -3.47e-2
5 heart_fai… creatini… crea… -2.77e-5    0.0000542   -0.511       0.609 -1.34e-4
6 heart_fai… lead      lead  -2.68e-3    0.00255     -1.05        0.293 -7.67e-3
# ℹ 2 more variables: `97.5 %` <dbl>, aic <dbl>

Now the table is simpler. Let’s create a gt table and apply all the formating

Create the gt table

To create the table, follow these steps:

  1. Initialize the gt table object
library(gt)
gt_table <- gt(
  data = results_short,
  groupname_col = "outcomes"
)
  1. Format numbers
gt_table <- gt_table |>
  fmt_number(
    columns = c("Estimate", "2.5 %", "97.5 %"),
    decimals = 3
  ) |>
  fmt(
    columns = "Pr(>|t|)",
    fns = function(x) format.pval(x, digits = 3)
  )
  1. Merge and hide columns
gt_table <- gt_table |>
  # Merge confidence interval values together
  cols_merge(
    columns = c("2.5 %", "97.5 %"),
    pattern = "{1} - {2}"
  ) |>
  # Hide non essential columns
  cols_hide(
    columns = c("Std. Error", "t value", "term", "aic")
  ) |>
  # Change column label
  cols_label("2.5 %" = "95% CI")
  1. Add table title and subtitle
gt_table <- gt_table %>%
  tab_header(
    title = "Models results",
    subtitle = "data = nh2007"
  )

gt_table
Models results
data = nh2007
exposures Estimate Pr(>|t|) 95% CI
asthma
creatinine 0.000 0.13505 0.000 - 0.000
lead 0.002 0.66801 −0.008 - 0.013
barium −0.003 0.15283 −0.007 - 0.001
cadmium 0.001 0.94783 −0.035 - 0.037
heart_failure
creatinine 0.000 0.60937 0.000 - 0.000
lead −0.003 0.29314 −0.008 - 0.002
barium 0.000 0.61605 −0.002 - 0.001
cadmium 0.002 0.76814 −0.014 - 0.019
coronary_heart_disease
creatinine 0.000 0.41578 0.000 - 0.000
lead −0.003 0.24778 −0.008 - 0.002
barium 0.000 0.69466 −0.002 - 0.002
cadmium 0.014 0.09495 −0.003 - 0.031
heart_attack
creatinine 0.000 0.92662 0.000 - 0.000
lead −0.003 0.29655 −0.009 - 0.003
barium −0.001 0.35293 −0.003 - 0.001
cadmium 0.026 0.00538 0.008 - 0.045

All these steps in the gt pipeline can be grouped into one function, for example, gt_models(), with only a results table as an argument. This means that you can easily apply the function to new model results and get a identical table!

Exercice: Create this function, add it to R/gt_models.R and add a line to source the file in your maim script.

gt_models <- function(results_clean) {
  library(gt)
  # 1. Initialize the gt table object
  gt(
    data = results_clean,
    groupname_col = "outcomes"
  ) |>
    # 2. Format numbers
    fmt_number(
      columns = c("Estimate", "2.5 %", "97.5 %"),
      decimals = 3
    ) |>
    fmt(
      columns = "Pr(>|t|)",
      fns = \(x) format.pval(x, digits = 3)
    ) |>
    # 3. Merge and hide columns
    cols_merge(
      columns = c("2.5 %", "97.5 %"),
      pattern = "{1} - {2}"
    ) |>
    cols_hide(
      columns = c("Std. Error", "t value", "term", "aic")
    ) |>
    cols_label("2.5 %" = "95% CI") |>
    # 4. Add table title and subtitle
    tab_header(
      title = "Models results"
  )
}

# Testing the function
gt_models(results_short)
Models results
exposures Estimate Pr(>|t|) 95% CI
asthma
creatinine 0.000 0.13505 0.000 - 0.000
lead 0.002 0.66801 −0.008 - 0.013
barium −0.003 0.15283 −0.007 - 0.001
cadmium 0.001 0.94783 −0.035 - 0.037
heart_failure
creatinine 0.000 0.60937 0.000 - 0.000
lead −0.003 0.29314 −0.008 - 0.002
barium 0.000 0.61605 −0.002 - 0.001
cadmium 0.002 0.76814 −0.014 - 0.019
coronary_heart_disease
creatinine 0.000 0.41578 0.000 - 0.000
lead −0.003 0.24778 −0.008 - 0.002
barium 0.000 0.69466 −0.002 - 0.002
cadmium 0.014 0.09495 −0.003 - 0.031
heart_attack
creatinine 0.000 0.92662 0.000 - 0.000
lead −0.003 0.29655 −0.009 - 0.003
barium −0.001 0.35293 −0.003 - 0.001
cadmium 0.026 0.00538 0.008 - 0.045

Remember to commit the new changes.

Generate a report with quarto

2007 report

Now we can put the nice table in a Quarto document to generate directly HTML or DOCX documents.

In RStudio create a new file: qmd/report.qmd In the report we should include a yaml header defining some execution parameters:

---
title: "NHANES Report"
author: "Your Name"
date: "date"
output: html_document # or docx
---

R code can then be executed within chunks. The first chunk should load all the packages and source all the functions:

```{r}
# Load packages
library(tidyverse)
library(here)

# Loading the functions
# here() gets the location of the project
source(here("R/descriptive.R"))
source(here("R/models.R"))
source(here("R/gt_models.R"))

# Alternative is to use .. to refer to the project root
# source("../R/descriptive.R")
# source("../R/models.R")
# source("../R/gt_models.R")
# ..  refer to the project root (one folder above the current one)

```

In between each chunk, you can add text to describe what you did, or even write some method points that you need to remember for later. The syntax in this part is done in markdown.

The following chunks can contain all the lines that include map functions.

ImportantExercise

Exercise: Convert your starting script into a Quarto document.

  • Create the file qmd/report.qmd
  • Open it and add the yaml header and a chunck to load all the package and library.
  • Add as many chuncks as you wish for each step of the analysis.

You can check the following file to see a example: report example

2009 report

This report is only based on the 2007 data. Since the 2009 data in the nh2009 dataset is very similar, we can reuse the functions with it. It’s very easy to replicate the analysis with new data: it only requires applying the same functions to the new dataset.

Let’s create a new section in the report:

# 2007

... 2007 analysis

# 2009
... 2009 analysis

The script for the 2009 analysis can look like this:

load(here("data/nh2009.RData"))

# Descriptive stats
purrr::map(.x = nh2009, .f = compute_descriptive_stats) |>
  dplyr::bind_rows(.id = "column")

# List outcomes
outcomes <- c("asthma", "heart_failure", "coronary_heart_disease", "heart_attack")
exposures <- c("creatinine", "lead", "barium", "cadmium")


models_parameters_2009 <- tidyr::expand_grid(outcomes, exposures)

models_2009 <- map2(
  .x = models_parameters_2009$outcomes,
  .y = models_parameters_2009$exposures,
  .f = \(x, y) build_model(x, y, dataset = nh2009) # we need to change the dataset
)

# Extract model results
models_results_2009 <- map(models_2009, extract_model_result)

results_2009 <- models_parameters_2009 |>
  dplyr::mutate(
    models = models_2009,
    models_results = models_results_2009
  )

results_model_clean_2009 <- results_2009 |>
  unnest(models_results) |>
  dplyr::filter(exposures == term) |>
  select(-models)

gt_models(results_model_clean_2009)

Exercice: Integrate the following lines in the report document, and render the document (Button in RStudio toolbar).

Congratulation, you finished the whole workshop!

You can check the following files to see the final files of the project:


{fig-align=“center”)