Home > Software design >  Trouble converting a list column into factor in R
Trouble converting a list column into factor in R

Time:11-08

I am trying to use data to predict some music scores. One of the columns is the genre, and it looks like this:

genre column

c("['rock-and-roll', 'space age pop', 'surf music']", "['dance pop', 'pop', 'post-teen pop']", 
"['pop', 'post-teen pop']", "['country', 'country dawn', 'nashville sound']", 
"['australian country', 'contemporary country', 'country', 'country road']", 
"['blues rock', 'garage rock', 'modern blues rock', 'neo-psychedelic', 'nu gaze', 'punk blues']", 
"['pop', 'post-teen pop']", "['adult standards', 'brill building pop', 'folk', 'folk rock', 'mellow gold', 'singer-songwriter', 'soft rock', 'yacht rock']", 
"['adult standards', 'brill building pop', 'bubblegum pop', 'folk rock', 'lounge', 'mellow gold', 'rock-and-roll', 'rockabilly', 'sunshine pop']", 
"['adult standards', 'brill building pop', 'canadian pop', 'easy listening', 'lounge', 'rock-and-roll']", 
"[]", "['boston rock', 'dance rock', 'new romantic', 'new wave', 'new wave pop']", 
"['classic soul']", "['classic country pop', 'country', 'nashville sound', 'outlaw country', 'singer-songwriter', 'texas country']", 
"['adult standards', 'brill building pop', 'bubblegum pop', 'doo-wop', 'rock-and-roll', 'rockabilly']", 
"['brill building pop', 'doo-wop', 'rhythm and blues']", "[]", 
"['album rock', 'art rock', 'blues rock', 'classic rock', 'hard rock', 'metal', 'psychedelic rock', 'rock', 'soft rock']", 
"['blues', 'blues rock', 'classic rock', 'electric blues', 'folk rock', 'funk', 'jazz blues', 'louisiana blues', 'new orleans blues', 'piano blues', 'psychedelic rock', 'roots rock', 'soul']", 
"['album rock', 'canadian pop', 'canadian singer-songwriter', 'classic canadian rock', 'heartland rock', 'mellow gold', 'rock', 'soft rock']", 
"['art rock', 'dance rock', 'new romantic', 'new wave', 'new wave pop', 'permanent wave', 'rock', 'synthpop']", 
"['album rock', 'blues rock', 'classic rock', 'country rock', 'hard rock', 'mellow gold', 'rock', 'soft rock', 'southern rock']", 
"['adult standards', 'brill building pop', 'easy listening', 'lounge', 'rock-and-roll', 'rockabilly']", 
"['christmas instrumental']", "['adult standards', 'brill building pop', 'bubblegum pop', 'classic country pop', 'country rock', 'folk', 'folk rock', 'mellow gold', 'soft rock']", 
"['adult standards', 'brill building pop', 'chicago soul', 'classic soul', 'motown', 'quiet storm', 'rhythm and blues', 'rock-and-roll', 'rockabilly', 'soul']")

I want to use it as a factor variable (or dummy factor variable) for prediction. How can I extract the genre names from the list and turn them into a dummy variable column?

What happens now when I convert the genre into dummy columns:

'adult standards', 'brill building pop', 'easy listening', 'mellow gold' 'dance pop', 'pop', 'post-teen pop'
1 0
0 1

What I want:

adult standards brill building pop
1 1
0 0

CodePudding user response:

A tidyverse solution

Centered around tidyr::separate_rows() and tidyr::pivot_longer():

library(tidyr)
library(dplyr)
library(stringr)

gdata <- gdata %>% 
  mutate(
    id = row_number(),
    genre = na_if(str_remove_all(genre, "\\[|\\]|'"), ""),
    value = 1
  ) %>% 
  separate_rows(genre, sep = ", ") %>% 
  pivot_wider(names_from = genre, values_fill = 0) %>% 
  select(!`NA`)

gdata
# A tibble: 26 × 71
      id rock-and-r…¹ space…² surf …³ dance…⁴   pop post-…⁵ country count…⁶ nashv…⁷
   <int>        <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1     1            1       1       1       0     0       0       0       0       0
 2     2            0       0       0       1     1       1       0       0       0
 3     3            0       0       0       0     1       1       0       0       0
 4     4            0       0       0       0     0       0       1       1       1
 5     5            0       0       0       0     0       0       1       0       0
 6     6            0       0       0       0     0       0       0       0       0
 7     7            0       0       0       0     1       1       0       0       0
 8     8            0       0       0       0     0       0       0       0       0
 9     9            1       0       0       0     0       0       0       0       0
10    10            1       0       0       0     0       0       0       0       0
# … with 16 more rows, 61 more variables: `australian country` <dbl>,
#   `contemporary country` <dbl>, `country road` <dbl>, `blues rock` <dbl>,
#   `garage rock` <dbl>, `modern blues rock` <dbl>, `neo-psychedelic` <dbl>,
#   `nu gaze` <dbl>, `punk blues` <dbl>, `adult standards` <dbl>,
#   `brill building pop` <dbl>, folk <dbl>, `folk rock` <dbl>,
#   `mellow gold` <dbl>, `singer-songwriter` <dbl>, `soft rock` <dbl>,
#   `yacht rock` <dbl>, `bubblegum pop` <dbl>, lounge <dbl>, rockabilly <dbl>, …

A base R solution

  1. For each value, remove extraneous characters and split on commas. This gives you a list containing one character vector for each row.
  2. Use unique(unlist()) to get a vector of all unique genres.
  3. Loop over the unique genres; for each, add a column to your dataframe testing whether that genre appears in each row. If you prefer 0s and 1s, you could add as.integer() here.
genre_list <- sapply(
  gdata$genre, 
  \(x) strsplit(gsub("\\[|\\]|'", "", x), ", ")
)

all_genres <- unique(unlist(genre_list))

for (g in all_genres) {
  gdata[[g]] <- sapply(genre_list, \(x) g %in% x)
}

gdata[1:10, 2:8]
   rock-and-roll space age pop surf music dance pop   pop post-teen pop country
1           TRUE          TRUE       TRUE     FALSE FALSE         FALSE   FALSE
2          FALSE         FALSE      FALSE      TRUE  TRUE          TRUE   FALSE
3          FALSE         FALSE      FALSE     FALSE  TRUE          TRUE   FALSE
4          FALSE         FALSE      FALSE     FALSE FALSE         FALSE    TRUE
5          FALSE         FALSE      FALSE     FALSE FALSE         FALSE    TRUE
6          FALSE         FALSE      FALSE     FALSE FALSE         FALSE   FALSE
7          FALSE         FALSE      FALSE     FALSE  TRUE          TRUE   FALSE
8          FALSE         FALSE      FALSE     FALSE FALSE         FALSE   FALSE
9           TRUE         FALSE      FALSE     FALSE FALSE         FALSE   FALSE
10          TRUE         FALSE      FALSE     FALSE FALSE         FALSE   FALSE
  •  Tags:  
  • r
  • Related