Home > Software engineering >  Filter out sequences of records in R/Python
Filter out sequences of records in R/Python

Time:04-13

I have a table of sequential records of different individuals that look something like this:

(Setting this up in R, although Python solutions are also OK.)

set.seed(22)

df <- data.frame(
  score = c(1, 4, 15, 12, 6,  7, 12, 7, 8, 5,  12, 19, 1, 4, 7, 12, 3, 5, 9, 4),
  info = sample(1:100, 20)
)
score info
1 9
4 88
15 74
12 94
6 44
7 59
12 81
7 67
8 48
5 16
12 58
19 72
1 62
4 31
7 65
12 49
3 21
5 68
9 33
4 32

The goal is to filter out certain rows of the table based on the score sequence. Here are the rules in human logic. Take the scores as a list:

1, 4, 15, 12, 6, 7, 12, 7, 8, 5 12, 19, 1, 4, 7 12, 3, 5, 9, 4

Step 1: Divide the list into sub-lists by instances of score == 12.

(The last sub-list is included regardless if it ends with the score == 12. And if the list do not contain any instances of score == 12, there would be one sub-list, i.e.: the entire list.)

[1, 4, 15, 12], [6, 7, 12], [7, 8, 5, 12], [19, 1, 4, 7, 12], [3, 5, 9, 4]

Step 2: Filter out all sub-lists that do not contain one or more instances of scores == 4.

[1, 4, 15, 12], [19, 1, 4, 7, 12], [3, 5, 9, 4]

Step 3: Then filter the data frame to include the rows corresponding to these scores only.

score info
1 9
4 88
15 74
12 94
19 72
1 62
4 31
7 65
12 49
3 21
5 68
9 33
4 32

Note that the scores are neither unique nor ordinal.

What would be the best way to implement this routine in either R or Python, hopefully without using a loop?

CodePudding user response:

Here is a base R method.

  1. First create a vec vector containing the index position where df$score == 12. Also include 0 and now(df) to it, which will be used later in lapply
  2. Create sublists mylist that ends with score == 12. deframe is used to create a named vector from a two-column dataframe using lapply
  3. Use the %in% operator to see if 4 is contained in the sublist, remove ones that don't contain 4
  4. And use enframe to create a two-column dataframe from a named vector.
vec <- sort(c(0, which(df$score == 12), nrow(df)))
mylist <- lapply(1:(length(vec) - 1), function(x) deframe(df[2:1])[(vec[x] 1):vec[x 1]])
enframe(mylist[sapply(mylist, function(x) 4 %in% x)] %>% unlist(), value = "score", name = "info")[2:1]

# A tibble: 13 × 2
   score info 
   <int> <chr>
 1     1 9    
 2     4 88   
 3    15 74   
 4    12 94   
 5    19 72   
 6     1 62   
 7     4 31   
 8     7 65   
 9    12 49   
10     3 21   
11     5 68   
12     9 33   
13     4 32   

CodePudding user response:

Using base R, we can use cumsum on a logical vector (score == 12) to create a grouping column and then filter if there is a 4 %in% score to keep only those groups and finally remove the grp column created - dplyr is used for the piping and the lag

library(dplyr)
df %>%
   group_by(grp = lag(cumsum(score == 12), default = 0)) %>% 
   filter(4 %in% score) %>%
   ungroup %>%
   select(-grp)

-output

# A tibble: 13 × 2
   score  info
   <dbl> <int>
 1     1     9
 2     4    88
 3    15    74
 4    12    94
 5    19    72
 6     1    62
 7     4    31
 8     7    65
 9    12    49
10     3    21
11     5    68
12     9    33
13     4    32
  • Related