I have a table containing two rows for each ID.
table <- tibble(
id = c(1,1,2,2,3,3,4,4,5,5),
row1 = c(2,5,2,5,1,3,2,5,3,2),
row2 = c("foo", "other foo", "bar", "bar", "bar", "bar other", "other", "foo", "other", "other")
)
> table
# A tibble: 10 × 3
id row1 row2
<dbl> <dbl> <chr>
1 1 2 foo
2 1 5 other foo
3 2 2 bar
4 2 5 bar
5 3 1 bar
6 3 3 bar other
7 4 2 other
8 4 4 foo
9 5 3 other
10 5 2 other
I would like to resolve the table to a single row for each ID based on three rules in succession:
- If, for each ID, there is one row in which row1 is 5 or more, then choose the row where row1 is less than 5.
- Else if, for each ID, there is a row in which row2 contains the word 'other', choose the row where row2 does not contain the word 'other'
- Otherwise, for each ID, pick the first row.
I feel there must be a more straightforward way of doing this. This is my attempt so far, but I've can't work out how to resolve the NA to return 'bar'.
table %>%
group_by(id) %>%
summarise(
row1 = ifelse(max(row1) >= 5,
first(row1[row1 < 5]),
ifelse(
grep("other", row2),
ifelse(
!is.na(first(row1[grep("other", row2, invert = T)])),
first(row1[grep("other", row2, invert = T)]),
first(row1)),
first(row1))
),
row2 = ifelse(
max(row1) >= 5,
first(row2[row1 < 5]),
ifelse(
grep("other", row2),
ifelse(
!is.na(first(row2[grep("other", row2, invert = T)])),
first(row2[grep("other", row2, invert = T)]),
first(row2)),
first(row2)
)
)
)
# A tibble: 5 × 3
id row1 row2
<dbl> <dbl> <chr>
1 1 2 foo
2 2 2 NA
3 3 1 bar
4 4 2 foo
5 5 3 other
Desired output:
id | row1 | row2 |
---|---|---|
1 | 2 | foo |
2 | 2 | bar |
3 | 1 | bar |
4 | 2 | other |
5 | 3 | other |
Many thanks for your help.
CodePudding user response:
Here is how we can do it:
library(dplyr)
library(tidyr)
library(stringr)
table %>%
group_by(id) %>%
separate_rows(row2) %>%
mutate(x = ifelse(row1>=5, min(row1),NA),
y = ifelse(str_detect(row2, 'other'), !str_detect(row2, 'other'), NA)) %>%
slice(1) %>%
select(-c(x, y))
id row1 row2
<dbl> <dbl> <chr>
1 1 2 foo
2 2 2 bar
3 3 1 bar
4 4 2 other
5 5 3 other
CodePudding user response:
A dplyr
solution:
table %>%
group_by(id) %>%
filter(row1 < 5 | n_distinct(row1 < 5) == 1) %>%
filter(!grepl("other", row2) | n_distinct(grepl("other", row2)) == 1) %>%
slice(1) %>% ungroup()
# # A tibble: 5 × 3
# id row1 row2
# <dbl> <dbl> <chr>
# 1 1 2 foo
# 2 2 2 bar
# 3 3 1 bar
# 4 4 2 other
# 5 5 3 other
n_distinct(...) == 1
is used to determine if a condition is all TRUE
or all FALSE
.
CodePudding user response:
Here is one solution, that leverages this small function, f()
using tidyverse
or data.table
f <- function(r1,r2) {
if(sum(r1>=5)==1) return(list("row1" =r1[r1<5], "row2"=r2[r1<5]))
if(sum(grepl("other",r2))==1) return(list("row1" = r1[!grepl("other",r2)], "row2"=r2[!grepl("other",r2)]))
list("row1"=r1[1],"row2"=r2[1])
}
Usage
library(tidyverse)
table %>%
group_by(id) %>%
summarize(n=list(f(row1,row2))) %>%
unnest_wider(n)
or
library(data.table)
setDT(table)[, f(row1,row2), by=id]
Output:
id row1 row2
<num> <num> <char>
1: 1 2 foo
2: 2 2 bar
3: 3 1 bar
4: 4 2 other
5: 5 3 other
CodePudding user response:
table %>%
group_by(id) %>%
subset(
case_when(
any(row1 >= 5) ~ row1 < 5,
any(grepl("other", row2)) ~ !grepl("other", row2),
T ~ T
)
) %>%
filter(row_number() == 1) %>%
ungroup()
This answer takes advantage of dplyr's grouping abilities to check for any() within each group, so it gets easy to know if a certain condition happens within a group.
It also uses case_when() to check for a series of conditions in a prioritized order, implementing what would be a series of if/else's.
Finally, since in whatever case we would like only the first row that matches the criteria, it uses the function row_number() to check whether we're on the first row within the group, in order to select it.
Output is:
# A tibble: 5 x 3
id row1 row2
<dbl> <dbl> <chr>
1 1 2 foo
2 2 2 bar
3 3 1 bar other
4 4 2 other
5 5 3 other
>