Home > Back-end >  Can I use regular expressions in R to turn a list into a data frame, with each column containing sim
Can I use regular expressions in R to turn a list into a data frame, with each column containing sim

Time:11-17

I have a list of ~700 character strings in a vector, something like this:

list <- c("orange", "Orange", "Orange juice", "orange juice", "lemon drink", "Lemon", "lemonade", "apple pie", "Apple", "Apples", "Grapefruit", "grape fruit", "Grapefruit tea")

And I'd like to create a dataframe with each different variant in a separate column, like this:

dataframe <- data.frame(column1 = c("orange", "lemon drink", "apple pie", "grapefruit"),
                        column2 = c("Orange", "Lemon", "Apple", "grape fruit"),
                        column3 = c("Orange juice", "lemonade", "Apples", "Grapefruit tea"),
                        column4 = c("orange juice", NA, NA, NA)
                        )

Is there a way of doing this using R, reducing manual work as much as possible? I'm a beginner with Regular Expressions, but I suspect this might be the way to go? If anyone can give me any pointers, that would be greatly appreciated. Thank you!

CodePudding user response:

Use a distance method to group the words by its similarity (soundex), split the vector of words into a list and convert the list back to a data.frame by appending NAs at the end (if the lengths are different

library(phonics)
library(stringi)
lst1 <- split(list, soundex(list, maxCodeLen = 3, clean = FALSE))
df1 <- as.data.frame(stri_list2matrix(lst1, byrow = TRUE))

-output

> df1
           V1          V2             V3           V4
1   apple pie       Apple         Apples         <NA>
2  Grapefruit grape fruit Grapefruit tea         <NA>
3 lemon drink       Lemon       lemonade         <NA>
4      orange      Orange   Orange juice orange juice
  • Related