Home > Software engineering >  how to separate a col into mutiple cols without knowing the final col number ahead of time?
how to separate a col into mutiple cols without knowing the final col number ahead of time?

Time:04-13

If I have a df and I want to separate player by sep="_". Is it a smart way that I can do this without manually check how many col I will need? For my example, it is easy to tell I need to separate player into 3 new cols. What is I have 100 rows. Is it a way to tell how many cols I need to separate player into? or is it a way to split player without knowing this info?

df <- data.frame(player=c('John_Wall', 'Dirk_Nowitzki', 'Steve_Nash_try'),
                 points=c(22, 29, 18),
                 assists=c(8, 4, 15))
## current method:
df %>% separate(player, c('v1', 'v2', 'v3'), sep ="_")

CodePudding user response:

We may use str_count to find the count of _, get the max count, convert to sequence, paste (str_c) with 'v' to create the column names. Add remove = FALSE, if the original column needs to kept

library(tidyr)
library(dplyr)
library(stringr)
df %>% 
   separate(player, str_c('v', 
      seq_len(max(str_count(.$player, '_')   1))), sep = "_", fill = "right")

-output

     v1       v2   v3 points assists
1  John     Wall <NA>     22       8
2  Dirk Nowitzki <NA>     29       4
3 Steve     Nash  try     18      15

Or instead of using separate, try with cSplit, which does automatically detect and create those columns

library(splitstackshape)
cSplit(df, 'player', sep = '_')
   points assists player_1 player_2 player_3
    <num>   <num>   <char>   <char>   <char>
1:     22       8     John     Wall     <NA>
2:     29       4     Dirk Nowitzki     <NA>
3:     18      15    Steve     Nash      try

If we want to use str_extract_all, it returns a list column, which can be converted to columns with unnest_wider

library(tidyr)
df %>% 
  mutate(v = str_extract_all(player, "[^_] ")) %>% 
  unnest_wider(v, names_sep = "_")
# A tibble: 3 × 6
  player         points assists v_1   v_2      v_3  
  <chr>           <dbl>   <dbl> <chr> <chr>    <chr>
1 John_Wall          22       8 John  Wall     <NA> 
2 Dirk_Nowitzki      29       4 Dirk  Nowitzki <NA> 
3 Steve_Nash_try     18      15 Steve Nash     try  

Or using base R

cbind(df,  read.table(text = df$player, sep = "_", fill = TRUE, 
    header = FALSE, na.strings = ""))
          player points assists    V1       V2   V3
1      John_Wall     22       8  John     Wall <NA>
2  Dirk_Nowitzki     29       4  Dirk Nowitzki <NA>
3 Steve_Nash_try     18      15 Steve     Nash  try

CodePudding user response:

Without packages you could adapt the `length<-` by maximum lengths.

strsplit(df$player, '_') |>
  {\(.) t(sapply(., `length<-`, max(lengths(.))))} () |>
  cbind(df[setdiff(names(df), 'player')])
#       1        2    3 points assists
# 1  John     Wall <NA>     22       8
# 2  Dirk Nowitzki <NA>     29       4
# 3 Steve     Nash  try     18      15
  •  Tags:  
  • r
  • Related