I have a column to split made like A, with a different number of elements(genres) for each observations(song). I can I split the column without specifing the destination columns in R?
Column A |
---|
"['hip hop', 'pop', 'pop rap', 'r&b', 'southern hip hop', 'trap', 'trap soul']" |
"['dance pop', 'girl group', 'pop', 'post-teen pop', 'talent show', 'uk pop']" |
I'm trying to have a result like that
genre 1 | genre 2 | genre ... | genre 6 | genre 7 |
---|---|---|---|---|
Hip Hop | pop | .... | trap | Neo soul |
dance pop | girl group | .... | Uk pop | N/A |
with the number of new columns equal to the maximum number of genre a song can have( for example if the song with more genres has ten genres, I should have have ten columns) .
Another option is to create a dummy column for every genre found in the column
Hip Hop | Pop | Pop Rap | r&b | .... |
---|---|---|---|---|
1 | 1 | 1 | 1 | .... |
0 | 1 | 0 | 0 | .... |
I tried with separate in R but it gave me error
CodePudding user response:
Without really knowing your desired output, but heres an idea:
df %>%
mutate(col_a = col_a %>% str_remove_all("\\[") %>%
str_remove_all("\\]") %>%
str_split(pattern = ", ")) %>%
unnest(col_a) %>%
count(col_a) %>%
pivot_wider(names_from = col_a, values_from = n)
# A tibble: 1 × 12
`'dance pop'` 'girl group…¹ 'hip …² `'pop'` 'pop …³ 'post…⁴ `'r&b'` 'sout…⁵ 'tale…⁶ 'trap…⁷ 'trap…⁸ 'uk p…⁹
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 2 1 1 1 1 1 1 1 1
# … with abbreviated variable names ¹`'girl group'`, ²`'hip hop'`, ³`'pop rap'`, ⁴`'post-teen pop'`,
# ⁵`'southern hip hop'`, ⁶`'talent show'`, ⁷`'trap'`, ⁸`'trap soul'`, ⁹`'uk pop'`
CodePudding user response:
In base R
, we could use read.csv
after removing the [
, ]
, and quotes ('
, "
)
df2 <- read.csv(text = gsub('\\[|\\]|\'|"', "", df1$ColumnA),
header = FALSE, na.strings = "", col.names = paste0("genre", 1:7))
-output
df2
genre1 genre2 genre3 genre4 genre5 genre6 genre7
1 hip hop pop pop rap r&b southern hip hop trap trap soul
2 dance pop girl group pop post-teen pop talent show uk pop <NA>
The second dataset can be created with mtabulate
on the output above
library(qdapTools)
mtabulate(as.data.frame(t(df2)))
-output
girl group pop pop rap post-teen pop r&b southern hip hop talent show trap trap soul uk pop dance pop hip hop
V1 0 1 1 0 1 1 0 1 1 0 0 1
V2 1 1 0 1 0 0 1 0 0 1 1 0
data
df1 <- structure(list(ColumnA = c("['hip hop', 'pop', 'pop rap', 'r&b',
'southern hip hop', 'trap', 'trap soul']",
"['dance pop', 'girl group', 'pop', 'post-teen pop', 'talent show', 'uk pop']"
)), class = "data.frame", row.names = c(NA, -2L))