Filter out sequences of records in R/Python-CodePudding

I have a table of sequential records of different individuals that look something like this:

(Setting this up in R, although Python solutions are also OK.)

set.seed(22)

df <- data.frame(
  score = c(1, 4, 15, 12, 6,  7, 12, 7, 8, 5,  12, 19, 1, 4, 7, 12, 3, 5, 9, 4),
  info = sample(1:100, 20)
)

score	info
1	9
4	88
15	74
12	94
6	44
7	59
12	81
7	67
8	48
5	16
12	58
19	72
1	62
4	31
7	65
12	49
3	21
5	68
9	33
4	32

The goal is to filter out certain rows of the table based on the score sequence. Here are the rules in human logic. Take the scores as a list:

1, 4, 15, 12, 6, 7, 12, 7, 8, 5 12, 19, 1, 4, 7 12, 3, 5, 9, 4

Step 1: Divide the list into sub-lists by instances of score == 12.

(The last sub-list is included regardless if it ends with the score == 12. And if the list do not contain any instances of score == 12, there would be one sub-list, i.e.: the entire list.)

[1, 4, 15, 12], [6, 7, 12], [7, 8, 5, 12], [19, 1, 4, 7, 12], [3, 5, 9, 4]

Step 2: Filter out all sub-lists that do not contain one or more instances of scores == 4.

[1, 4, 15, 12], [19, 1, 4, 7, 12], [3, 5, 9, 4]

Step 3: Then filter the data frame to include the rows corresponding to these scores only.

score	info
1	9
4	88
15	74
12	94
19	72
1	62
4	31
7	65
12	49
3	21
5	68
9	33
4	32

Note that the scores are neither unique nor ordinal.

What would be the best way to implement this routine in either R or Python, hopefully without using a loop?

CodePudding user response：

Here is a base R method.

First create a vec vector containing the index position where df$score == 12. Also include 0 and now(df) to it, which will be used later in lapply
Create sublists mylist that ends with score == 12. deframe is used to create a named vector from a two-column dataframe using lapply
Use the %in% operator to see if 4 is contained in the sublist, remove ones that don't contain 4
And use enframe to create a two-column dataframe from a named vector.

vec <- sort(c(0, which(df$score == 12), nrow(df)))
mylist <- lapply(1:(length(vec) - 1), function(x) deframe(df[2:1])[(vec[x] 1):vec[x 1]])
enframe(mylist[sapply(mylist, function(x) 4 %in% x)] %>% unlist(), value = "score", name = "info")[2:1]

# A tibble: 13 × 2
   score info 
   <int> <chr>
 1     1 9    
 2     4 88   
 3    15 74   
 4    12 94   
 5    19 72   
 6     1 62   
 7     4 31   
 8     7 65   
 9    12 49   
10     3 21   
11     5 68   
12     9 33   
13     4 32

CodePudding user response：

Using base R, we can use cumsum on a logical vector (score == 12) to create a grouping column and then filter if there is a 4 %in% score to keep only those groups and finally remove the grp column created - dplyr is used for the piping and the lag

library(dplyr)
df %>%
   group_by(grp = lag(cumsum(score == 12), default = 0)) %>% 
   filter(4 %in% score) %>%
   ungroup %>%
   select(-grp)

-output

# A tibble: 13 × 2
   score  info
   <dbl> <int>
 1     1     9
 2     4    88
 3    15    74
 4    12    94
 5    19    72
 6     1    62
 7     4    31
 8     7    65
 9    12    49
10     3    21
11     5    68
12     9    33
13     4    32

score	info
1	9
4	88
15	74
12	94
6	44
7	59
12	81
7	67
8	48
5	16
12	58
19	72
1	62
4	31
7	65
12	49
3	21
5	68
9	33
4	32

score	info
1	9
4	88
15	74
12	94
6	44
7	59
12	81
7	67
8	48
5	16
12	58
19	72
1	62
4	31
7	65
12	49
3	21
5	68
9	33
4	32

score	info
1	9
4	88
15	74
12	94
6	44
7	59
12	81
7	67
8	48
5	16
12	58
19	72
1	62
4	31
7	65
12	49
3	21
5	68
9	33
4	32