I have a table of sequential records of different individuals that look something like this:
(Setting this up in R, although Python solutions are also OK.)
set.seed(22)
df <- data.frame(
score = c(1, 4, 15, 12, 6, 7, 12, 7, 8, 5, 12, 19, 1, 4, 7, 12, 3, 5, 9, 4),
info = sample(1:100, 20)
)
score | info |
---|---|
1 | 9 |
4 | 88 |
15 | 74 |
12 | 94 |
6 | 44 |
7 | 59 |
12 | 81 |
7 | 67 |
8 | 48 |
5 | 16 |
12 | 58 |
19 | 72 |
1 | 62 |
4 | 31 |
7 | 65 |
12 | 49 |
3 | 21 |
5 | 68 |
9 | 33 |
4 | 32 |
The goal is to filter out certain rows of the table based on the score sequence. Here are the rules in human logic. Take the scores as a list:
1, 4, 15, 12, 6, 7, 12, 7, 8, 5 12, 19, 1, 4, 7 12, 3, 5, 9, 4
Step 1: Divide the list into sub-lists by instances of score == 12.
(The last sub-list is included regardless if it ends with the score == 12. And if the list do not contain any instances of score == 12, there would be one sub-list, i.e.: the entire list.)
[1, 4, 15, 12], [6, 7, 12], [7, 8, 5, 12], [19, 1, 4, 7, 12], [3, 5, 9, 4]
Step 2: Filter out all sub-lists that do not contain one or more instances of scores == 4.
[1, 4, 15, 12], [19, 1, 4, 7, 12], [3, 5, 9, 4]
Step 3: Then filter the data frame to include the rows corresponding to these scores only.
score | info |
---|---|
1 | 9 |
4 | 88 |
15 | 74 |
12 | 94 |
19 | 72 |
1 | 62 |
4 | 31 |
7 | 65 |
12 | 49 |
3 | 21 |
5 | 68 |
9 | 33 |
4 | 32 |
Note that the scores are neither unique nor ordinal.
What would be the best way to implement this routine in either R or Python, hopefully without using a loop?
CodePudding user response:
Here is a base R method.
- First create a
vec
vector containing the index position wheredf$score == 12
. Also include 0 andnow(df)
to it, which will be used later inlapply
- Create sublists
mylist
that ends withscore == 12
.deframe
is used to create a named vector from a two-column dataframe usinglapply
- Use the
%in%
operator to see if 4 is contained in the sublist, remove ones that don't contain 4 - And use
enframe
to create a two-column dataframe from a named vector.
vec <- sort(c(0, which(df$score == 12), nrow(df)))
mylist <- lapply(1:(length(vec) - 1), function(x) deframe(df[2:1])[(vec[x] 1):vec[x 1]])
enframe(mylist[sapply(mylist, function(x) 4 %in% x)] %>% unlist(), value = "score", name = "info")[2:1]
# A tibble: 13 × 2
score info
<int> <chr>
1 1 9
2 4 88
3 15 74
4 12 94
5 19 72
6 1 62
7 4 31
8 7 65
9 12 49
10 3 21
11 5 68
12 9 33
13 4 32
CodePudding user response:
Using base R
, we can use cumsum
on a logical vector (score == 12
) to create a grouping column and then filter
if there is a 4 %in%
score to keep only those groups and finally remove the grp
column created - dplyr
is used for the piping and the lag
library(dplyr)
df %>%
group_by(grp = lag(cumsum(score == 12), default = 0)) %>%
filter(4 %in% score) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 13 × 2
score info
<dbl> <int>
1 1 9
2 4 88
3 15 74
4 12 94
5 19 72
6 1 62
7 4 31
8 7 65
9 12 49
10 3 21
11 5 68
12 9 33
13 4 32