I have a column in a data frame with multiple words in each cell separated by ";" (second column).
my_dataframe <- data.frame( first_column = c("x", "y", "x", "x", "y"),
second_column = c("important; very important; not important",
"not important; important; very important",
"very important; important",
"important; not important",
"not important"))
> my_dataframe
first_column second_column
1 x important; very important; not important
2 y not important; important; very important
3 x very important; important
4 x important; not important
5 y not important
I want to keep one word per cell: the most important one.
So I made a list of the words by order of priority:
reference_importance <- list("very important", "important", "not important")
What I would like to get as a second column:
second_column
1 very important
2 very important
3 very important
4 important
5 not important
I tried
for (i in 1:dim(my_dataframe)[1]) {
for (j in 1:length(reference_importance)) {
if (j %in% my_dataframe$second_column){
my_dataframe$second_column[i] <- paste(j)
break}
}
}
Then I thought the problem was that it didn't consider the different words separated by ";" so I tried this:
for (i in 1:dim(my_dataframe)[1]) {
value_as_list <- strsplit(my_dataframe$second_column[i], ";")
print(value_as_list)
for (j in reference_importance) {
if (j %in% value_as_list){
my_dataframe$second_column[i] == j
break}
}
}
But these don't change anything in my column...
(I made this example to simplify, but in reality I have a huge table with many more words and possibilities. That's why I try to do it with a loop and I don't just assign the possible answers manually.)
CodePudding user response:
Using strsplit
and match
basically.
my_dataframe <- transform(my_dataframe, z=strsplit(second_column, '; ') |>
lapply(match, reference_importance) |>
sapply(min) |>
{\(x) unlist(reference_importance)[x]}())
my_dataframe
# first_column second_column z
# 1 x important; very important; not important very important
# 2 y not important; important; very important very important
# 3 x very important; important very important
# 4 x important; not important important
# 5 y not important not important
Note: R >= 4.1 used.
If you need a loop you may do
spl <- strsplit(my_dataframe$second_column, '; ')
my_dataframe$z <- NA_character_
for (i in seq_along(spl)) {
my_dataframe$z[i] <- reference_importance[[min(match(spl[[i]], reference_importance))]]
}
my_dataframe
# first_column second_column z
# 1 x important; very important; not important very important
# 2 y not important; important; very important very important
# 3 x very important; important very important
# 4 x important; not important important
# 5 y not important not important
Of course I used z
for demonstration purposes, actually you would use second_column
instead of z
.
CodePudding user response:
If you want to use a loop, the following worked for me:
my_dataframe <- data.frame( first_column = c("x", "y", "x", "x", "y"),
second_column = c("important; very important; not important",
"not important; important; very important",
"very important; important",
"important; not important",
"not important"))
reference_importance <- list("very important", "important", "not important")
# add new column for priority word
my_dataframe <- my_dataframe %>%
mutate(Priority_importance = NA)
# use a loop to identify highest priority substring
for (i in 1:nrow(my_dataframe)) {
value_as_list <- strsplit(my_dataframe$second_column[i], ";")
for (j in 1:length(reference_importance)) {
if (value_as_list == as.character((reference_importance[j]))) {
my_dataframe$Priority_importance[i] <- reference_importance[j] # paste importance level
break # move to next iteration
}
}
}
my_dataframe
first_column second_column Priority_importance
1 x important; very important; not important very important
2 y not important; important; very important very important
3 x very important; important very important
4 x important; not important important
5 y not important not important
CodePudding user response:
One option with dplyr
and tidyr
:
my_dataframe %>%
rowid_to_column() %>%
separate_rows(second_column, sep = "; ") %>%
group_by(rowid) %>%
slice_min(match(second_column, reference_importance))
rowid first_column second_column
<int> <chr> <chr>
1 1 x very important
2 2 y very important
3 3 x very important
4 4 x important
5 5 y not important
I used reference_importance as a character vector instead of a list:
reference_importance <- c("very important", "important", "not important")
CodePudding user response:
Another possible solution, based on tidyverse
:
library(tidyverse)
my_dataframe %>%
mutate(id = row_number()) %>%
separate_rows(second_column, sep = "\\s*;\\s*") %>%
group_by(id) %>%
slice(match(reference_importance, second_column) %>% na.omit() %>% .[1]) %>%
ungroup %>%
select(-id)
#> # A tibble: 5 × 2
#> first_column second_column
#> <chr> <chr>
#> 1 x very important
#> 2 y very important
#> 3 x very important
#> 4 x important
#> 5 y not important