real time web analytics

Making tables for multinomial models with {modelsummary} and {brms}

Academia
Statistics
Models
Publishing
Methods
Data Science
brms
Author

Michael E. Flynn

Published

June 14, 2022

NOTE Since moving the site to Quarto I’ve been having some trouble getting Quarto to render this particular post. Something to do witht he particular model objects that I read in in the second code chunk. I think the post is still useful, but I’m going to omit the output from the code chunks since it breaks the post. I’ll try to follow up with someone smarter than I to see how I can go about fixing things.

tl;dr: Learn how to make some cool and customizable tables for multinomial logit models using {brms} and {modelsummary}.

UPDATE: Vincent informed me that the most recent version of {modelsummary} relies entirely on the {parameters} package. Apparently the {broom} package will no longer be actively developing. Keep this in mind if you’re trying this approach and get stuck.

Background

I’m coming off a couple of long projects where we were using a lot of multinomial logit models, and making publication quality tables was a major challenge. Actually, started using {brms} several years ago partly because I had data where we had 1) lots of individuals making choices, 2) those individuals were all grouped in some pretty clear ways, and 3) we were also interested in modeling group-level characteristics that might relate to individuals’ choices. This more or less marked my full transition from Stata to R and from frequentist stats to Bayesian stats.

Early on the biggest problem I ran into was finding a way to generate tables for multinomial models run using {brms}. Initially most packages didn’t support {brms} and/or developing tables for multilevel/hierarchical models required a lot of extra legwork. Building tables to accommodate choice models was another issue. Taken together, these problems meant that I had to write a lot of extensive code by hand to generate clean Latex tables for the models I was using. The process has, thankfully, become much simpler over the last couple of years.

First, for those who aren’t familiar {brms} is an amazing package created by Paul Bürkner. It stands for Bayesian Regression Modeling using Stan, and, as the name suggests, provides users with a convenient front-end for building regression models using Stan as a back end. This is particularly useful for building multilevel models, but also lots of other stuff.

Second, {modelsummary} is another amazing and ever-expanding package created by Vincent Arel-Bundock. This packages does a few different things, but most importantly and prominently it helps users to create excellent tables for summarizing data and building tables for regression output. The package is incredibly flexible, supports dozens of different model types, and Vincent is constantly adding new features.

Both packages are fantastic. If you’re interested in Bayesian modeling, or just looking for a new and easy way to build tables for whatever modeling package you already use, you should check out one or both of these packages.

Multilevel Multinomial Logit Models

Lots of regression models are going to be fairly simple to present in a table format, and there are some fairly easy ways to go about generating those tables. Typically you’ll have a single column per model, and each row of your table will be for a single variable. You might also have some summary statistics for the model at the bottom (think $ N $, $ R^2 $, etc.).

Multinomial models get a little more complicated because you’ll typically have multiple outcomes. Specifically, you’ll have $k-1 $ columns to present in the table, where $k $ is the number of choices respondents have. In our case we had lots of models where survey respondents offered their assessments of various actors and we condensed those assessments down into four general categories: Positive, Neutral, Negative, and Don’t Know. This means we ultimately ended up with three columns per model, with the “Neutral” response serving as the baseline category against which the others were compared.

Multilevel models complicate things slightly because you may also have summary statistics for the groups in your data in addition to the general summary statistics for the model.

Last, Bayesian models and {brms} specifically provide users with a ton of additional information they might want to present beyond the traditional stuff you’d find in frequentist models. Some of this information can take a long time to compile. There’s often going to be an efficiency and transparency tradeoff here, and so you may want to customize what you present in your table. Even if you end up presenting lots of information in the end, having the ability to control what’s in your table at the outset can be really useful as you run the code to make sure the basic output looks right.

Anyway, the goal here is rather niche, but it’s to talk through the process of building nice and readable tables when we’re using these models and have lots of information to present. {modelsummary} lets us do this, but also requires a bit of additional effort to fully customize our output.

Getting Started

OK, first we’re going to load our libraries. The relevance of some of these is immediately obvious given what I’ve already said, but I’ll talk more about some of the additional packages we need below.

# Load libraries
library(tidyverse)
library(here)
library(modelsummary)
library(brms)
library(parameters)
library(kableExtra)
library(broom)
library(broom.mixed)
library(data.table)

In addition to loading the libraries we want to load our model objects. In this case I have three models, each with three outcome categories. Each of these ends with a “p1” or a “t1” to denote the reference group for the model’s outcome variable (e.g. if the outcome variable is asking about troops, people, or government).

Then I’m creating a list object to store the three model objects. Note here that I’m leaving the label for the individual models blank by including the empty quotation marks in the list function. Depending on your situation you can go ahead and name these if you want. In my case it makes more sense to keep them empty because I plan to add a grouping header/title later in the final table and based on how {modelsummary} works including the titles here would create some redundancies in the final output and take up extra horizontal space. If you’re working with HTML output and you have scrollable tables maybe this doesn’t matter, but it can matter a lot for print versions..

m.c.t1 <- readRDS(here::here("files/data-files/m.c.t1.rds"))
m.c.p1 <- readRDS(here::here("files/data-files/m.c.p1.rds"))
m.c.g1 <- readRDS(here::here("files/data-files/m.c.g1.rds"))

# Create a list object to store the three separate model objects.
mod.list <- list(" " = m.c.t1,
                 " " = m.c.p1,
                 " " = m.c.g1)

Before we move on, let’s take a quick look at the models and what they look like.

summary(mod.list[[1]])

There’s a lot going on here, but you can see from the summary output that organizing this could be a bit of a bear. We have three outcome categories, lots of categorical variables (some conceptually related), model summary statistics, etc.

Customization

The first two chunks are pretty boiler plate, and for lots of types of models you can probably just go on to use {modelsummary} directly and get some nice tables. But this section is going to dive into some of the extra steps required to make the tables look nice and clean, but also to save us a ton of time.

The big issue that we have to address is that {modelsummary} is going to automatically try to generate lots of different types of goodness-of-fit or model summary statistics for the models you’re including. If this were OLS it would be super quick. But since we’re using {brms} models the defaults can run for a very long time and, depending on your workflow needs, cause some major slowdowns.

{modelsummary} is using the packages like {tidy}, {parameters}, and {broom} to extract information from the models in your list and to generate a basic data frame with that output that serves as the basis for the final table. Things like WAIC and LOOIC can take a long time to calculate, and depending on your needs you might not want them right way.

Beyond that, you might just want to customize how the footer of your table looks, what information is included in which places, etc. This is a good way to do that, but it takes a little extra work.

First, we need to assign a “custom” class to each of the three model objects stored in the mod.list object. Assigning the custom class is going allow us to write some custom {tidy} and {glance} functions that will let us select the specific summary stats and other model info that we want to include.

for (i in seq_along(mod.list)){
  class(mod.list[[i]]) <- c("custom", class(mod.list[[i]]))
}

Next we can write out custom tidy function, appropriately labeled here tidy.custom. Actually, I think you have to name it this so {modelsummary} recognizes that you want to use a custom function.

There’s a lot going on here, and some of it is going to be specific to the models I’m using as an example.

First, you create the function, where x is the object placeholder, and conf.level=.95 is specifying the default confidence/credible interval threshold. The ... just leaves it open for other arguments.

Next we create an object named out using the {parameters} package. First we start off creating a data frame using this function, then we standardize all of the variable/column names (i.e. make them lower case). So far pretty standard.

The mutate chunk is where things get more complicated, and where you’ll need to substitute your own model-specific factors. {modelsummary} also has a formula-based argument for helping you to arrange your models in the resulting table, but we need to do a little extra work here to properly identify the outcome variable levels. This is partly a function of the fact that {brms} does some weird things like attaching outcome choices (i.e. the binary outcome variable for the individual equations) as prefixes to each variable name, meaning there are no outcome levels for {parameters} to automatically detect. So we need to create them.

All I’m going here is using case_when(...) to identify the outcome variable levels/choices and creating a categorical y.level variable containing the full name for each of the choices for the outcome variable. As I mentioned above, those are “Don’t know”, “Negative”, and “Positive” (with “Neutral” as the omitted reference category).

Next, I take the term column that {parameters} generates, which contains all of the predictor variables, and I remove the outcome level prefixes that {brms} attaches. This way I have two columns with variable names (term) and the outcome variable choice/model equation (i.e. y.level).

# tidy method extracts level names into new column
tidy.custom <- function(x, conf.level=.95, ...) {                                   # Create function
  out = parameters::parameters(x, ci=conf.level, verbose=FALSE) |>                 # Call {parameters} to pull model parameter info with specified credible interval
        parameters::standardize_names(style="broom") |>                            # make names lower case
        mutate(y.level = case_when(grepl("mudk", term) ~ "Don't Know",              # Change outcome level values to plain meaning for output table
                                 grepl("muneg", term) ~ "Negative",
                                 grepl("mupos", term) ~ "Positive"),
               term = gsub("dk|neg|pos", "", term))                                 # remove outcome prefix {brms} attaches to variable names
  return(out)
}

Next we can move on to writing a custom function to pull the relevant model and summary statistic information. First, we need to tell glance() to quiet down since it’s going to do lots of stuff we don’t necessarily want it to do right now. I can’t remember if the gof.check lines are necessary at this point (I seem to recall it had no effect one way or the other when I was initially working on this), but I’ll turn those off just to be safe, too.

# Write custom glance function to extract summary information.
glance.custom <- function(x, ...) {
  ret <- tibble::tibble(N = summary(x)$nobs)
  ret
}

# Turn off GOF stuff
gof.check <- modelsummary::gof_map
gof.check$omit <- TRUE

Next we can move on to using the {parameters} package to full information on the “random” part of our varying intercepts model. We’ll also use this step to include some basic summary information on the models as mentioned above.

Ultimately the goal here is to generate a clean data frame containing summary information that we want. I’ve included more specific comments in the code chunk below for readers who want to scrutinize each step, but much of this mirrors what we just did with the tidy.custom() function above, it’s just doing it to the “random” or “varying” part of the model rather than the population-level coefficients.

# Write function to loop over list of models
rows <- lapply(mod.list, function(x){

  temp.sd <- parameters::parameters(x, effect = "random")  |>                   # start with parameters and pull "random" component of model
    filter(grepl(".*sd.*", Parameter)) |>                                       # filter out parameters containing standard deviation info
    filter(grepl(".*persYes.*|.*nonpersYes.*|.*Intercept.*", Parameter)) |>     # further filter parameters containing SD info for relevant variables
    dplyr::select(Parameter, Median) |>                                         # Keep only the Parameter (name) column and the Median column
    dplyr::mutate(Median = as.character(round(Median, 2)),
      y.level = case_when(                                                       # Like before, create a y.level column for outcome variable level/equation
      grepl(".*mudk.*", Parameter) ~ "dk",
      grepl(".*mupos.*", Parameter) ~ "pos",
      grepl(".*muneg.*", Parameter) ~ "neg",
    ),
    Parameter = gsub("_mudk_|_mupos_|_muneg_", "_", Parameter)) |>              # Remove the {brms} prefix from the Parameter column
    pivot_wider(id_cols = Parameter,                                             # Rearrange  columns from long to wide
                values_from = Median,
                names_from = y.level) |> 
    mutate(Parameter = case_when(                                                # Rename relevant parameters to appear how you want in text
      grepl(".*Intercept.*", Parameter) ~ "sd(Intercept)"
    ))
  
  temp.obs <- tibble::tribble(~Parameter, ~dk, ~neg, ~pos,                       # Create another data frame containing observation count and grouping info
                              "N", as.character(nobs(x)), "", "",
                              "Group", "Country", "", "",
                              "\\# Groups", as.character(length(unique(x$data$country))), "", "")
  

  temp.com <- bind_rows(temp.obs, temp.sd)                                       # Bind two data frames together for consolidated footer data frame

  return(temp.com)
  
  }
)

# Group everything from the three models together
# Also select relevant columns containing information
rows.com <- bind_cols(rows[[1]], rows[[2]], rows[[3]]) |> 
  dplyr::select(1, 2, 3, 4, 6, 7, 8, 10, 11, 12) 

# Rename those columns so they'll match eventual output data frame names
names(rows.com) <- c("term", "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9")

Great! Next we can start cleaning up the variable names for presenting them in the table, and we can even go a step further to add grouping labels to help readers move between broad categories of predictor variables (e.g. income, age, etc.).

Here’s we’re going to generate a tribble with the “raw” variable names and the “clean” label for presentation. Note that we have to arrange them in the order in which we want them to appear.

coef.list <- tibble::tribble(~raw, ~clean,
                            "b_mu_contact_persYes", "Personal Contact: Yes",
                            "b_mu_contact_persDontknowDdeclinetoanswer", "Personal Contact: DK/Decline",
                            "b_mu_contact_nonpersYes", "Network Contact: Yes",
                            "b_mu_contact_nonpersDontknowDdeclinetoanswer", "Network Contact: DK/Decline",
                            "b_mu_benefit_persYes", "Personal Benefit: Yes",
                            "b_mu_benefit_persDontknowDdeclinetoanswer", "Personal Benefit: DK/Decline",
                            "b_mu_benefit_nonpersYes", "Network Benefit: Yes",
                            "b_mu_benefit_nonpersDontknowDdeclinetoanswer", "Network Benefit: DK/Decline",
                            "b_mu_age25to34years", "25-34",
                            "b_mu_age35to44years", "35-44",
                            "b_mu_age45to54years", "45-54",
                            "b_mu_age55to64years", "55-65",
                            "b_mu_ageAge65orolder", ">65",
                            "b_mu_income.5.cat21M40%", "21-40",
                            "b_mu_income.5.cat41M60%", "41-60",
                            "b_mu_income.5.cat61M80%", "61-80",
                            "b_mu_income.5.cat81M100%", "81-100",
                            "b_mu_genderFemale", "Female",
                            "b_mu_genderNonMbinary", "Non-binary",
                            "b_mu_genderNoneoftheabove", "None of the above",
                            "b_mu_minorityYes", "Minority: Yes",
                            "b_mu_minorityDeclinetoanswer", "Minoriy: Decline to answer",
                            "b_mu_religCatholicism", "Catholic",
                            "b_mu_religChristianityprotestant", "Protestant",
                            "b_mu_religBuddhism", "Buddhism",
                            "b_mu_religHinduism", "Hindu",
                            "b_mu_religIslam", "Islam",
                            "b_mu_religJudaism", "Judaism",
                            "b_mu_religShinto", "Shinto",
                            "b_mu_religMormonism", "Mormonism",
                            "b_mu_religLocal", "Local Religion",
                            "b_mu_religOther", "Other",
                            "b_mu_religDeclinetoanswer", "Religion: Decline to answer",
                            "b_mu_ed_z", "Education",
                            "b_mu_ideology_z", "Ideology",
                            "b_mu_troops_crime_persYes", "Personal Crime Experience: Yes",
                            "b_mu_american_inf_1DontknowDdeclinetoanswer", "Influence 1: DK/Decline",
                            "b_mu_american_inf_1Alittle", "Influence 1: A little",
                            "b_mu_american_inf_1Some", "Influence 1: Some",
                            "b_mu_american_inf_1Alot", "Influence 1: A lot",
                            "b_mu_american_inf_2DontknowDdeclinetoanswer", "Influence 2: DK/Decline",
                            "b_mu_american_inf_2Veryative", "Influence 2: Very negative",
                            "b_mu_american_inf_2Negative", "Influence 2: Negative",
                            "b_mu_american_inf_2Positive", "Influence 2: Positive",
                            "b_mu_american_inf_2Veryitive", "Influence 2: Very positive",
                            "b_mu_basecount_z", "Base count",
                            "b_mu_gdp_z", "GDP",
                            "b_mu_pop_z", "Population",
                            "b_mu_troops_z", "Troop deployment size",
                            "b_mu_Intercept", "Intercept")

Next we’re going to add a new column that contains the grouping list. Since we’re using lots of categorical predictor variables we want to make sure they’re grouped in a sensible way.

coef.list <- coef.list |> 
  mutate(group = case_when(
           grepl(".*ontact.*", raw) ~ "Contact Status",
           grepl(".*enefit.*", raw) ~ "Economic Benefits",
           grepl(".*age.*", raw) ~ "Age",
           grepl(".*ncome.*", raw) ~ "Income Quintile",
           grepl(".*gender.*", raw) ~ "Gender Identification",
           grepl(".*minority.*", raw) ~ "Minority Self-Identification",
           grepl(".*relig.*", raw) ~ "Religious Identification",
           grepl(".*ed_z.*", raw) ~ "Education",
           grepl(".*ideology_z.*", raw) ~ "Ideology",
           grepl(".*crime.*", raw) ~ "Crime Experience",
           grepl(".*inf_1.*", raw) ~ "American Influence (Amount)",
           grepl(".*inf_2.*", raw) ~ "American Influence (Quality)",
           TRUE ~ "Group-Level Variables"
         )) 

Next, because we’re dealing with a really wide table we’ll have coefficients with standard errors underneath. This means that each variable name is actually going to take up two lines. So we need to do a little extra here to properly format this bit. I have to admit I think Vincent actually came up with this particular solution as I was puttering with it for a while. So more props to him!

# Find how long the coefficient list is for the final table hline
last.line <- length(coef.list[[1]]) * 2

coef_map <- setNames(coef.list$clean, coef.list$raw)
idx <- rle(coef.list$group)
idx <- setNames(idx$lengths  * 2, idx$values)

Putting it all together

Finally, we’ve done all of the prep work. Now we can generate the actual table itself! I’m going to change just a couple of things here. Since the output is going to appear on a webpage, I’ll change the “output” argument to HTML. I’ll also delete the save_kable() bit that tells it to save to a latex file, but you can add that in at the end if you want output.

You can see that we use the last.line values from the previous chunk in that last row_spec() line to tell it where to put an horizontal line.

modelsummary::modelsummary(mod.list,
                  estimate = "{estimate}",
                  statistic = "conf.int",
                  fmt = 2,
                  group = term ~ model + y.level,
                  gof_map = gof.check,
                  coef_map = coef_map,
                  add_rows = rows.com,
                  stars = FALSE,
                  output = "kableExtra",
                  caption = "Bayesian multilevel multinomial logistic regressions. Population level effects. \\label{tab:contactfull}") |> 
  kable_styling(bootstrap_options = c("striped"), font_size = 10, position = "center", full_width = FALSE) |>
  add_header_above(c(" ", "US Presence" = 3, "US People" = 3, "US Government" = 3)) |> 
  group_rows(index = idx, bold = TRUE, background = "gray", color = "white", hline_after = TRUE)  |>  
  row_spec(last.line, hline_after = TRUE) |> 
  column_spec(1, width = "5cm") |>
  column_spec(2:10, width = "3cm") |> 
  kableExtra::scroll_box(width = "100%", height = "600px") 

What’s left?

There are a few things I’d like to tweak. For example, {modelsummary} adds little slashes before the outcome variable name and inserts the name from the model list before that. I leave the list entries blank to avoid this since horizontal space is a premium.

Blogdown/hugo is also being a bit frustrating with the final table. I’ve tried making the font larger, but for whatever reason the horizontal scrollbar isn’t working with the kableExtra table in this environment. Not great, but it looks ok for now. Sizing and scale were a littl tough to tweak in Tex files, but I got it more or less where it needed to be by the end. I’m sure someone with a little more skill than me can figure out some of the hiccups.

There’s way more you can do to cusomtize the output of these tables, particularly in the footer. You might want to add particular model fit or summary statistics, and the glance.custom() function is a great place to specify what you want. Much is possible, but I’m going to stop here for now since it took me something like a month to get this posted.