splitting a column in columns, different number of elements-CodePudding

I have a column to split made like A, with a different number of elements(genres) for each observations(song). I can I split the column without specifing the destination columns in R?

Column A
"['hip hop', 'pop', 'pop rap', 'r&b', 'southern hip hop', 'trap', 'trap soul']"
"['dance pop', 'girl group', 'pop', 'post-teen pop', 'talent show', 'uk pop']"

I'm trying to have a result like that

genre 1	genre 2	genre ...	genre 6	genre 7
Hip Hop	pop	....	trap	Neo soul
dance pop	girl group	....	Uk pop	N/A

with the number of new columns equal to the maximum number of genre a song can have( for example if the song with more genres has ten genres, I should have have ten columns) .

Another option is to create a dummy column for every genre found in the column

Hip Hop	Pop	Pop Rap	r&b	....
1	1	1	1	....
0	1	0	0	....

I tried with separate in R but it gave me error

CodePudding user response：

Without really knowing your desired output, but heres an idea:

df %>%  
  mutate(col_a = col_a %>% str_remove_all("\\[") %>% 
           str_remove_all("\\]") %>% 
           str_split(pattern = ", ")) %>% 
  unnest(col_a) %>% 
  count(col_a) %>% 
  pivot_wider(names_from = col_a, values_from = n)

# A tibble: 1 × 12
  `'dance pop'` 'girl group…¹ 'hip …² `'pop'` 'pop …³ 'post…⁴ `'r&b'` 'sout…⁵ 'tale…⁶ 'trap…⁷ 'trap…⁸ 'uk p…⁹
          <int>         <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>
1             1             1       1       2       1       1       1       1       1       1       1       1
# … with abbreviated variable names ¹`'girl group'`, ²`'hip hop'`, ³`'pop rap'`, ⁴`'post-teen pop'`,
#   ⁵`'southern hip hop'`, ⁶`'talent show'`, ⁷`'trap'`, ⁸`'trap soul'`, ⁹`'uk pop'`

CodePudding user response：

In base R, we could use read.csv after removing the [, ], and quotes (', ")

df2 <- read.csv(text = gsub('\\[|\\]|\'|"', "", df1$ColumnA), 
 header = FALSE, na.strings = "", col.names = paste0("genre", 1:7))

-output

df2
     genre1      genre2   genre3         genre4            genre5  genre6     genre7
1   hip hop         pop  pop rap            r&b  southern hip hop    trap  trap soul
2 dance pop  girl group      pop  post-teen pop       talent show  uk pop       <NA>

The second dataset can be created with mtabulate on the output above

library(qdapTools)
mtabulate(as.data.frame(t(df2)))

-output

    girl group  pop  pop rap  post-teen pop  r&b  southern hip hop  talent show  trap  trap soul  uk pop dance pop hip hop
V1           0    1        1              0    1                 1            0     1          1       0         0       1
V2           1    1        0              1    0                 0            1     0          0       1         1       0

data

df1 <- structure(list(ColumnA = c("['hip hop', 'pop', 'pop rap', 'r&b', 
'southern hip hop', 'trap', 'trap soul']", 
"['dance pop', 'girl group', 'pop', 'post-teen pop', 'talent show', 'uk pop']"
)), class = "data.frame", row.names = c(NA, -2L))