Home > Mobile >  How to drop duplicated column names in nested tibbles before unnesting
How to drop duplicated column names in nested tibbles before unnesting

Time:09-25

I have a tibble with a column containing (nested) tibbles. The nested tibbles have duplicated data (same names, same values):

df <- tibble(id = 1:2, data = list(tibble(id = 1, var1 = "a", var2 = "b"), tibble(id = 2, var1 = "c", var2 = "d")))

df

# # A tibble: 2 x 2
#      id data            
#   <int> <list>          
# 1     1 <tibble [1 x 3]>
# 2     2 <tibble [1 x 3]>  

Calling df %>% unnest(data) results in

Error: Names must be unique..

I would like to write a function that drops these columns beforehand but don't know how. My goal is to be run to use something along the lines of:

df %>% 
  drop_duplicated_cols(data) %>% 
  unnest(data)

Which would result in:

#> # A tibble: 2 x 4
#>      id var1  var2 
#>   <int> <chr> <chr>
#> 1     1 a     b    
#> 2     2 c     d    

CodePudding user response:

You can just use unnest_longer

library(tidyverse)
df %>%
  unnest_longer(data)

which gives:

# A tibble: 2 x 2
     id data$id $var1 $var2
  <int>   <dbl> <chr> <chr>
1     1       1 a     b    
2     2       2 c     d 

If you want to truly convert the tibble column, you can also use:

df %>%
  left_join(reduce(df$data, bind_rows), by = "id")

which gives:

# A tibble: 2 x 4
     id data             var1  var2 
  <dbl> <list>           <chr> <chr>
1     1 <tibble [1 x 3]> a     b    
2     2 <tibble [1 x 3]> c     d   

And from there you can e.g. unselect the data column.

CodePudding user response:

There is a names_repair argument in unnest. By default, it is "check_unique" and this causes the error. If we change it to another option i.e. "unique", it will make the columns that are duplicated with .. followed by some digits, then we use select to get the columns not duplicated

library(dplyr)
library(tidyr)
library(stringr)
df %>%
    unnest(data, names_repair = "unique") %>%
    select(names(.)[!duplicated(str_remove(names(.), "\\. .*"))]) %>%
    rename_with(~ str_remove(., "\\. .*"), contains(".."))

-output

# A tibble: 2 x 3
     id var1  var2 
  <int> <chr> <chr>
1     1 a     b    
2     2 c     d   

Or another option to avoid the warnings is to remove the duplicate columns within the list by looping with map

library(purrr)
out <- df %>% 
   mutate(data = map(data, ~ .x %>% 
            select(-any_of(names(df))))) %>%
   unnest(data)
out
# A tibble: 2 x 3
     id var1  var2 
  <int> <chr> <chr>
1     1 a     b    
2     2 c     d    
> str(out)
tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
 $ id  : int [1:2] 1 2
 $ var1: chr [1:2] "a" "c"
 $ var2: chr [1:2] "b" "d"

NOTE: Both solutions give the expected output as in the OP's post along with the structure

  • Related