Hands-on Activity 2

Published

June 5, 2024

Introduction

Goal

At the end of this exercise, you should be able to produce an HTML document containing at least one table to summarize model results using gt and quarto with the nh2007 dataset. If we have enough time, you can reproduce the analysis with the nh2009 dataset.

Tasks

  • Improve and simplify the code using functional programming
  • Create a QMD document to generate a report using Quarto
  • Make a table using gt to summarize model results
  • Continue to use Git to save your progress

Functional Programming

Descriptive Statistics

Let’s look at the script that we finished before the break. We created some functions that we now use for all the descriptive statistics. However, we still have to copy and paste the same call many times.

# Numbers
compute_descriptive_stats(nh2007$gender)
compute_descriptive_stats(nh2007$education)
compute_descriptive_stats(nh2007$education_child)
compute_descriptive_stats(nh2007$asthma)
compute_descriptive_stats(nh2007$heart_failure)
compute_descriptive_stats(nh2007$coronary_heart_disease)
compute_descriptive_stats(nh2007$creatinine)
compute_descriptive_stats(nh2007$lead)
compute_descriptive_stats(nh2007$barium)
compute_descriptive_stats(nh2007$cadmium)


# Graph
compute_descriptive_graph(nh2007$creatinine)
compute_descriptive_graph(nh2007$lead)
compute_descriptive_graph(nh2007$barium)
compute_descriptive_graph(nh2007$cadmium)

Exercice: Can you figure out how to use map to avoid copy pasting the call?

The map function can take a data frame as its first argument and will apply the function to each column of the data frame:

purrr::map(.x = nh2007, .f = compute_descriptive_stats)

For the descriptive graphs we can first use dplyr::select() to select the four columns of interest:

nh2007 |>
  dplyr::select(creatinine, lead, barium, cadmium) |>
  purrr::map(.f = compute_descriptive_graph)

map return a list by default:

purrr::map(.x = nh2007, .f = compute_descriptive_stats) |>
  head(3)
$id
$id$mean
[1] 46562.61

$id$sd
[1] 2966.326

$id$quantiles
      0%      25%      50%      75%     100% 
41477.00 44060.75 46513.00 49178.75 51622.00 


$gender
variable
   1    2 <NA> 
1022 1012    0 

$age_screening
$age_screening$mean
[1] 47.41593

$age_screening$sd
[1] 19.2215

$age_screening$quantiles
  0%  25%  50%  75% 100% 
  16   31   47   63   80 

List are not so easy to work with because functions are mainly build to use a dataframe. However it’s very easy to modify compute_table() and compute_numeric() to return statistics in a different format.

When the two functions return a dataframe, it’s will be possible to use dplyr::bind_rows() to bind all the elements of the list from map into a single dataframe.

Let’s first change the two compute functions and load them into our environment:

compute_table <- function(variable) {
  # Return frequency table as a dataframe
  table(variable, useNA = "always", dnn = "level") |>
    as.data.frame()
}

compute_numeric <- function(variable) {
  mean_value <- mean(variable, na.rm = TRUE)
  sd_value <- sd(variable, na.rm = TRUE)
  quantiles <- quantile(variable, na.rm = TRUE)

  # Return statistics as a dataframe
  cbind(
    data.frame(
      "mean" = mean_value,
      "sd" = sd_value
    ),
    t(quantiles)
  )
}

Then we can run the lines with map again:

purrr::map(.x = nh2007, .f = compute_descriptive_stats) |>
  dplyr::bind_rows(.id = "column") |>
  head(5)
         column        mean        sd    0%      25%   50%      75%  100% level
1            id 46562.61406 2966.3259 41477 44060.75 46513 49178.75 51622  <NA>
2        gender          NA        NA    NA       NA    NA       NA    NA     1
3        gender          NA        NA    NA       NA    NA       NA    NA     2
4        gender          NA        NA    NA       NA    NA       NA    NA  <NA>
5 age_screening    47.41593   19.2215    16    31.00    47    63.00    80  <NA>
  Freq
1   NA
2 1022
3 1012
4    0
5   NA

The result is already much better and easier to work with. Notice that we didn’t have to change the compute_descriptive_stats() function, only the computation functions. This is because no calculations are happening in compute_descriptive_stats(): it’s only calling the appropriate compute function for each type of variable.

Tip

When modifying functions, it’s a good idea to commit the changes frequently so you can revert any unwanted changes if needed.

Jump back in Git-gui and commit the new changes to the functions.

Models

We can also use functional programming to simplify the creation of the models. For the 4 outcomes and the 4 exposures, the models are always the same:

# Creatinine
model.1a <- glm(asthma ~ barium + age_screening + gender, data = nh2007)
model.1.b <- glm(heart_failure ~ barium + age_screening + gender, data = nh2007)
model.1.c <- glm(coronary_heart_disease ~ barium + age_screening + gender, data = nh2007)
model.1.d <- glm(heart_attack ~ barium + age_screening + gender, data = nh2007)
model.1.e <- glm(asthma ~ barium + age_screening + gender, data = nh2007) # Notice that this model is identical to 1a, but it's hard to notice

The custom function to create need to have at least 2 arugments: one argument for the outcome and another for the exposure. It’s also a good idea to add another argument to specify the dataset to use in the model.

map2 and pmap are variant of map that can iterate over two or many arguments simultaneously.

All the functions that we will create can be saved in R/models.R. Remember to add a line to source this file in your script.

Exercise: Create a function to build the glm models. Use map2 to apply the function to all combinations of outcomes and exposures.

Note
build_model <- function(outcome, exposure, dataset) {
  formula <- paste0(
    outcome, " ~ ",
    exposure, " + age_screening + gender"
  ) |>
    as.formula()

  try(
    glm(formula, data = dataset)
  )
}

# List outcomes
outcomes <- c("asthma", "heart_failure", "coronary_heart_disease", "heart_attack")
exposures <- c("creatinine", "lead", "barium", "cadmium")


models_parameters <- tidyr::expand_grid(outcomes, exposures)

models <- map2(
  .x = models_parameters$outcomes,
  .y = models_parameters$exposures,
  .f = \(x, y) build_model(x, y, dataset = nh2007)
)
  • try() is used to catch any errors when the models don’t run for any reason. This allows map to continue instead of stopping.
  • .f = \(x, y) build_model(x, y, dataset = nh2007) helps pass a common element to all the models. Here, it’s the dataset that we want to use for the models. Of course, this dataset can also be passed firectly by pmap to make the dataset vary, but it can be a little tricky.

Once again, all the models are in a list. It’s easy to access a model using indexes like models[[1]], but it’s a bit crude. It is possible to improve that by:

  • Assigning names to each list element from the model parameters, so you can access models using a $ sign.
  • Putting the models in a column in the models parameters dataframe (often a better solution).

Extract models results

Exercice: Create a function to extract estimates, CI, p-pvalue and AIC from a model. Use map to apply it to every model.

Note

It can be difficult to create a function from scratch, especially when working with lists. It’s sometimes easier to first work with one element of the list to test the function.

test_model <- models[[1]]
extract_model_result <- function(model) {
  # Get coefficients
  coefs <- coef(summary(model)) |>
    as.data.frame()

  # confidence interval
  ci <- confint(model) |>
    as.data.frame()

  # AIC
  aic <- model$aic

  # Return a dataframe
  cbind(
    coefs,
    ci,
    aic
  ) |>
    rownames_to_column(var = "term")
}

# Extract model results
models_results <- map(models, extract_model_result)

The broom package is a very nice interface to reliably extract models results in a consistent way. It can easily replace this custom-made function: https://broom.tidymodels.org/

Make a table using gt

Prepare the data

In this section, we will focus on creating just one table because the gt package offers many options, which could be the topic of an entire workshop. Feel free to experiment with it later if you have time, for example, to create a table for the descriptive statistics.

Before creating a gt table, it’s best to have a clean table with only the information that we want to present. The first step is to consolidate the model parameters, models, and model results into one table.

results <- models_parameters |>
  dplyr::mutate(
    models = models,
    models_results = models_results
  )

Then we can unnest() the models_results column in the dataframe, and filter the terms from the model results to keep only the terms matching the exposure of the model.

results_short <- results |>
  unnest(models_results) |>
  dplyr::filter(exposures == term) |>
  select(-models)

head(results_short)
# A tibble: 6 × 10
  outcomes   exposures term  Estimate `Std. Error` `t value` `Pr(>|t|)`  `2.5 %`
  <chr>      <chr>     <chr>    <dbl>        <dbl>     <dbl>      <dbl>    <dbl>
1 asthma     creatini… crea… -1.67e-4    0.000111    -1.50        0.135 -3.85e-4
2 asthma     lead      lead   2.38e-3    0.00555      0.429       0.668 -8.49e-3
3 asthma     barium    bari… -3.03e-3    0.00212     -1.43        0.153 -7.18e-3
4 asthma     cadmium   cadm…  1.20e-3    0.0183       0.0654      0.948 -3.47e-2
5 heart_fai… creatini… crea… -2.77e-5    0.0000542   -0.511       0.609 -1.34e-4
6 heart_fai… lead      lead  -2.68e-3    0.00255     -1.05        0.293 -7.67e-3
# ℹ 2 more variables: `97.5 %` <dbl>, aic <dbl>

Now the table is simple and it’s possible to create a gt table and to go handle all the formating

Create the gt table

To create the table, follow these steps:

  1. Initialize the gt table object
library(gt)
Warning: package 'gt' was built under R version 4.3.3
gt_table <- gt(
  data = results_short,
  groupname_col = "outcomes"
)

gt_table
exposures term Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 % aic
asthma
creatinine creatinine -1.666218e-04 1.114468e-04 -1.49507911 0.135049228 -0.0003850536 5.180995e-05 1802.2351
lead lead 2.379870e-03 5.548226e-03 0.42894247 0.668010610 -0.0084944533 1.325419e-02 1804.2892
barium barium -3.028811e-03 2.117844e-03 -1.43013881 0.152831073 -0.0071797093 1.122087e-03 1802.4252
cadmium cadmium 1.198378e-03 1.831328e-02 0.06543767 0.947831976 -0.0346949935 3.709175e-02 1804.4692
heart_failure
creatinine creatinine -2.772027e-05 5.424036e-05 -0.51106355 0.609367443 -0.0001340294 7.858888e-05 -1270.1983
lead lead -2.676537e-03 2.545335e-03 -1.05154615 0.293144934 -0.0076653009 2.312227e-03 -1271.0444
barium barium -4.963558e-04 9.896629e-04 -0.50154023 0.616050549 -0.0024360595 1.443348e-03 -1270.1887
cadmium cadmium 2.474920e-03 8.393849e-03 0.29484929 0.768142044 -0.0139767213 1.892656e-02 -1270.0237
coronary_heart_disease
creatinine creatinine -4.562735e-05 5.605650e-05 -0.81395298 0.415776468 -0.0001554961 6.424136e-05 -1147.8785
lead lead -3.041374e-03 2.630681e-03 -1.15611644 0.247782569 -0.0081974150 2.114667e-03 -1148.5536
barium barium -4.016007e-04 1.022938e-03 -0.39259548 0.694663436 -0.0024065216 1.603320e-03 -1147.3691
cadmium cadmium 1.448436e-02 8.669517e-03 1.67072316 0.094945244 -0.0025075783 3.147630e-02 -1150.0099
heart_attack
creatinine creatinine -5.655582e-06 6.140079e-05 -0.09210927 0.926621182 -0.0001259989 1.146878e-04 -809.6718
lead lead -3.008399e-03 2.881167e-03 -1.04415985 0.296547637 -0.0086553836 2.638585e-03 -810.7556
barium barium -1.040694e-03 1.120050e-03 -0.92914986 0.352932446 -0.0032359522 1.154563e-03 -810.5283
cadmium cadmium 2.642442e-02 9.481671e-03 2.78689532 0.005375826 0.0078406907 4.500816e-02 -817.4306
  1. Format numbers
gt_table <- gt_table |>
  fmt_number(
    columns = c("Estimate", "2.5 %", "97.5 %"),
    decimals = 3
  ) |>
  fmt(
    columns = "Pr(>|t|)",
    fns = \(x) format.pval(x, digits = 3)
  )
  1. Merge and hide columns
gt_table <- gt_table |>
  cols_merge(
    columns = c("2.5 %", "97.5 %"),
    pattern = "{1} - {2}"
  ) |>
  cols_hide(
    columns = c("Std. Error", "t value", "term")
  ) |>
  cols_label("2.5 %" = "95% CI")
  1. Add table title and subtitle
gt_table <- gt_table %>%
  tab_header(
    title = "Models results",
    subtitle = "data = nh2007"
  )

gt_table
Models results
data = nh2007
exposures Estimate Pr(>|t|) 95% CI aic
asthma
creatinine 0.000 0.13505 0.000 - 0.000 1802.2351
lead 0.002 0.66801 −0.008 - 0.013 1804.2892
barium −0.003 0.15283 −0.007 - 0.001 1802.4252
cadmium 0.001 0.94783 −0.035 - 0.037 1804.4692
heart_failure
creatinine 0.000 0.60937 0.000 - 0.000 -1270.1983
lead −0.003 0.29314 −0.008 - 0.002 -1271.0444
barium 0.000 0.61605 −0.002 - 0.001 -1270.1887
cadmium 0.002 0.76814 −0.014 - 0.019 -1270.0237
coronary_heart_disease
creatinine 0.000 0.41578 0.000 - 0.000 -1147.8785
lead −0.003 0.24778 −0.008 - 0.002 -1148.5536
barium 0.000 0.69466 −0.002 - 0.002 -1147.3691
cadmium 0.014 0.09495 −0.003 - 0.031 -1150.0099
heart_attack
creatinine 0.000 0.92662 0.000 - 0.000 -809.6718
lead −0.003 0.29655 −0.009 - 0.003 -810.7556
barium −0.001 0.35293 −0.003 - 0.001 -810.5283
cadmium 0.026 0.00538 0.008 - 0.045 -817.4306

All these steps in the gt pipeline can be grouped into one function, for example, gt_models(), with only results_clean as an argument. This means that you can easily reapply the function to each results table and get a identical table!

Exercice: Create this function and add it to R/gt_models.R.

gt_models <- function(results_clean) {
  library(gt)
  # 1. Initialize the gt table object
  gt(
    data = results_clean,
    groupname_col = "outcomes"
  ) |>
    # 2. Format numbers
    fmt_number(
      columns = c("Estimate", "2.5 %", "97.5 %"),
      decimals = 3
    ) |>
    fmt(
      columns = "Pr(>|t|)",
      fns = \(x) format.pval(x, digits = 3)
    ) |>
    # 3. Merge and hide columns
    cols_merge(
      columns = c("2.5 %", "97.5 %"),
      pattern = "{1} - {2}"
    ) |>
    cols_hide(
      columns = c("Std. Error", "t value", "term")
    ) |>
    cols_label("2.5 %" = "95% CI") |>
    # 4. Add table title and subtitle
    tab_header(
      title = "Models results"
  )
  
}

Remember to commit this new function and to source it in your main script.

Generate the report with quarto

Now that we have many functions and a nice table, we can put them in a Quarto document to generate either HTML or DOCX documents.

In RStudio create a new file: qmd/report.qmd In the report we should include a yaml header defining some execution parameters:

---
title: "NHANES Report"
author: "Your Name"
date: "2024-05-31"
output: html_document # or docx
---

R code can then be executed within chunks. The first chunk should load all the packages and source all your functions:

```{r}
# Load packages
library(tidyverse)
source("../R/descriptive.R")
source("../R/models.R")
source("../R/gt_models.R")
# .. is needed to refer to the project root

# Alternative using the here package
source(here("R/descriptive.R"))
source(here("R/models.R"))
source(here("R/gt_models.R"))
```

In between each chunk, you can add text to describe what you did, or even write some method points that you need to remember for later. The syntax in this part is done in markdown.

The following chunks can contain all the lines that include map functions.

Exercise: Convert your starting script into a Quarto document.

  • Create the file qmd/report.qmd
  • Open it and add the yaml header and a chunck to load all the package and library.
  • Add as many chuncks as you wish for each step of the analysis.

You can check the following file to see a example: report example

2009 report

Now that the functions are all created, adding the analysis for the nh2009 dataset is simple, as it only requires applying the same functions to another dataset.

The report can be divided into two sections:

# 2007

... 2007 analysis

# 2009
... 2009 analysis

The only thing that needs to be copied and pasted is the data management part. The script for the 2009 analysis can look like this:

load(here("data/nh2009.RData"))

# Descriptive stats
purrr::map(.x = nh2009, .f = compute_descriptive_stats) |>
  dplyr::bind_rows(.id = "column")

# List outcomes
outcomes <- c("asthma", "heart_failure", "coronary_heart_disease", "heart_attack")
exposures <- c("creatinine", "lead", "barium", "cadmium")


models_parameters_2009 <- tidyr::expand_grid(outcomes, exposures)

models_2009 <- map2(
  .x = models_parameters_2009$outcomes,
  .y = models_parameters_2009$exposures,
  .f = \(x, y) build_model(x, y, dataset = nh2009) # we need to change the dataset
)

# Extract model results
models_results_2009 <- map(models_2009, extract_model_result)

results_2009 <- models_parameters_2009 |>
  dplyr::mutate(
    models = models_2009,
    models_results = models_results_2009
  )

results_model_clean_2009 <- results_2009 |>
  unnest(models_results) |>
  dplyr::filter(exposures == term) |>
  select(-models)

gt_models(results_model_clean_2009)

Exercice: Intergrate the following lines in the report document

Check the following files to see the final state of the project: