I have a list of ~700 character strings in a vector, something like this:
list <- c("orange", "Orange", "Orange juice", "orange juice", "lemon drink", "Lemon", "lemonade", "apple pie", "Apple", "Apples", "Grapefruit", "grape fruit", "Grapefruit tea")
And I'd like to create a dataframe with each different variant in a separate column, like this:
dataframe <- data.frame(column1 = c("orange", "lemon drink", "apple pie", "grapefruit"),
column2 = c("Orange", "Lemon", "Apple", "grape fruit"),
column3 = c("Orange juice", "lemonade", "Apples", "Grapefruit tea"),
column4 = c("orange juice", NA, NA, NA)
)
Is there a way of doing this using R, reducing manual work as much as possible? I'm a beginner with Regular Expressions, but I suspect this might be the way to go? If anyone can give me any pointers, that would be greatly appreciated. Thank you!
CodePudding user response:
Use a distance method to group the words by its similarity (soundex
), split
the vector of words into a list
and convert the list
back to a data.frame by appending NA
s at the end (if the lengths
are different
library(phonics)
library(stringi)
lst1 <- split(list, soundex(list, maxCodeLen = 3, clean = FALSE))
df1 <- as.data.frame(stri_list2matrix(lst1, byrow = TRUE))
-output
> df1
V1 V2 V3 V4
1 apple pie Apple Apples <NA>
2 Grapefruit grape fruit Grapefruit tea <NA>
3 lemon drink Lemon lemonade <NA>
4 orange Orange Orange juice orange juice