Home > OS >  tibble::add_row() or similar with automatic type promotion
tibble::add_row() or similar with automatic type promotion

Time:08-17

Edit: Simply use rbind from base!


I have a list of tibbles with the same column names and orders, but possibly incompatible column types. I would like to vertically-concatenate the tables into one, à la tibble::add_row(), automatically converting types to the greatest common denominator where necessary (in the same way that e.g., c(1, 2, "a") returns c("1", "2", "a"). I don’t know the types of columns in advance.

For example,

> X = tibble(a = 1:3, b = c("a", "b", "c"))
# A tibble: 3 × 2
      a b    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    

> Y = tibble(a = "Any", b = 1)
# A tibble: 1 × 2
  a         b
  <chr> <dbl>
1 Any       1

Desired output:

# A tibble: 4 × 2
  a     b    
  <chr> <chr>
1 1     a    
2 2     b    
3 3     c    
4 Any   1    

Is there a way to do this generically? I’m trying to write code for a package that is agnostic about data frames and tibbles (i.e., it doesn’t convert into one or the other).

Ideally, type promotion should reflect the behaviour of c(...) (NULL < raw < logical < integer < double < complex < character < list < expression) — except for factors, where I’d like to preserve the factor label (whatever its type), not the underlying index.

CodePudding user response:

I think rbind(X, Y) has achieved what you want. Herer is another idea. Assume that X and Y have the same column names and orders, you could use map2() from purrr to apply c() over the corresponding columns from X and Y.

purrr::map2_dfc(X, Y, c)

# # A tibble: 4 × 2
#   a     b
#   <chr> <chr>
# 1 1     a
# 2 2     b
# 3 3     c
# 4 Any   1

If X and Y do not have the same column names and orders, you could intersect their names and follow the same way:

cols <- intersect(names(X), names(Y))
purrr::map2_dfc(X[cols], Y[cols], c)

CodePudding user response:

Utilising the overly liberal behaviour of base R by doing do.call(rbind, list(X, Y)) would get you some of the way there, but comes with some downsides, such as that the order in which you combine things matters (consider the output of as.character(TRUE) vs as.character(as.integer(TRUE)).

A better approach would probably be to look at all of your data frames to work out what final column types you need to cast to, and cast your columns to these types separately before combining the data frames. Here's a function that will do this:

library(tidyverse)

coerce_bind_rows <- function(...) {
  
  casts <- list(
    raw = NULL,
    logical = as.logical,
    integer = as.integer,
    numeric = as.numeric,
    double = as.double,
    character = as.character,
    list = as.list
  )
  
  dfs <- list(...)
  
  dfs_fmt_objs <- map(dfs, mutate, across(where(is.object), format))
  
  targets <-
    dfs_fmt_objs |>
    map(partial(map_chr, ... = , typeof)) |>
    pmap(c) |>
    map(factor, levels = names(casts), ordered = TRUE) |>
    map(compose(as.character, max))
  
  dfs_casted <-
    dfs_fmt_objs |>
    map(function(.data, .types = targets) {
      for (.col in names(.types)) {
        .fn <- casts[[.types[[.col]]]]
        .data[[.col]] <- .fn(.data[[.col]])
      }
      .data
    })
  
  bind_rows(dfs_casted)
  
}

[Edited to format classed objects to handle factors as specified in update to the question]

Testing on your examples above:


X <- tibble(a = 1:3, b = c("a", "b", "c"))
Y <- tibble(a = "Any", b = 1)

coerce_bind_rows(X, Y)
#> # A tibble: 4 x 2
#>   a     b    
#>   <chr> <chr>
#> 1 1     a    
#> 2 2     b    
#> 3 3     c    
#> 4 Any   1

Testing on some data frames with a broader range of types:

W <- tibble(a = FALSE, b = raw(1L))
Z <- tibble(a = list(4), b = "d")

coerce_bind_rows(W, X, Y, Z)
#> # A tibble: 6 x 2
#>   a         b    
#>   <list>    <chr>
#> 1 <lgl [1]> 00   
#> 2 <int [1]> a    
#> 3 <int [1]> b    
#> 4 <int [1]> c    
#> 5 <chr [1]> 1    
#> 6 <dbl [1]> d

By the way, data frame columns have to be vectors (which include atomic vectors or lists), so you can't have a data frame with columns that are NULLs or expressions. But this approach should also work for everything between raw to list type vectors.

  • Related