Home > Software design >  How can I split sentence into new variables in R (with zero-one encoding)?
How can I split sentence into new variables in R (with zero-one encoding)?

Time:11-04

I have a data like below:

V1  V2
1   orange, apple
2   orange, lemon
3   lemon, apple
4   orange, lemon, apple
5   lemon
6   apple
7   orange
8   lemon, apple

I want to split the V2 variable like this:

  • I have three categories of the V2 column: "orange", "lemon", "apple"
  • for each of the categories I want to create a new column (variable) that will inform about whether such a name appeared in V2 (0,1)

I tried this

df %>% separate(V2, into = c("orange", "lemon", "apple"))

.. and I got this result, but it's not what I expect.

  V1 orange lemon apple
1  1   orange   apple    <NA>
2  2   orange   lemon    <NA>
3  3    lemon   apple    <NA>
4  4   orange   lemon   apple
5  5    lemon    <NA>    <NA>
6  6    apple    <NA>    <NA>
7  7   orange    <NA>    <NA>
8  8    lemon   apple    <NA>

The result I mean is below.

V1  orange  lemon   apple
1   1   0   1
2   1   1   0
3   0   1   1
4   1   1   0
5   0   1   0
6   0   0   1
7   1   0   0
8   0   1   1

CodePudding user response:

you could try pivoting:

library(dplyr)
library(tidyr)
df |> 
  separate_rows(V2, sep = ", ") |> 
  mutate(ind = 1) |> 
  pivot_wider(names_from = V2,
              values_from = ind,
              values_fill = 0)

Output is:

# A tibble: 8 × 4
     V1 orange apple lemon
  <int>  <dbl> <dbl> <dbl>
1     1      1     1     0
2     2      1     0     1
3     3      0     1     1
4     4      1     1     1
5     5      0     0     1
6     6      0     1     0
7     7      1     0     0
8     8      0     1     1

data I used:

V1 <- 1:8
V2 <- c("orange, apple", "orange, lemon", 
        "lemon, apple", "orange, lemon, apple",
        "lemon", "apple", "orange", 
        "lemon, apple")
df <- tibble(V1, V2) 

CodePudding user response:

We may use dummy_cols

library(stringr)
library(fastDummies)
library(dplyr)
dummy_cols(df, "V2", split = ",\\s ", remove_selected_columns = TRUE) %>% 
  rename_with(~ str_remove(.x, '.*_'))

-output

# A tibble: 8 × 4
     V1 apple lemon orange
  <int> <int> <int>  <int>
1     1     1     0      1
2     2     0     1      1
3     3     1     1      0
4     4     1     1      1
5     5     0     1      0
6     6     1     0      0
7     7     0     0      1
8     8     1     1      0
  • Related