R - Filtering rows - Keep data between two repeated values-CodePudding

I would like to keep only rows before and after a specific values of a column in my data frame. My data frame below has a structure, where you can see that I have some sort of blocks if I can say so. For example, the data I am interested with always starts at group and ends at section, anything else outside I would like to remove.

# Sample Data
df = data.frame(
  Name = c ("x1","NA","group","Jean","Philippe","Celine","Dion","section","NA",
            "y2","z1","NA","group","Rosa","Albert","Stromae","section","NA","abc","something"), 
  value = as.character(seq(1:20))
)

df
        Name value
1         x1     1
2         NA     2
3      group     3
4       Jean     4
5   Philippe     5
6     Celine     6
7       Dion     7
8    section     8
9         NA     9
10        y2    10
11        z1    11
12        NA    12
13     group    13
14      Rosa    14
15    Albert    15
16   Stromae    16
17   section    17
18        NA    18
19       abc    19
20 something    20

Since the block group:section does not always have the same information, I don't know how can I tell R to keep rows between group andsection even if they are repeated. I only came up with this, which just keeps the rows the first time R sees group andsection.

df[which(df$Name=="group")[1]:which(df$Name=="section")[1],]
      Name value
3    group     3
4     Jean     4
5 Philippe     5
6   Celine     6
7     Dion     7
8  section     8

The desired output would be


4       Jean     4
5   Philippe     5
6     Celine     6
7       Dion     7
14      Rosa    14
15    Albert    15
16   Stromae    16

Thank you guys in advance for your help.

CodePudding user response：

There are multiple instances of those cases. So, we may need to loop to get the : as : is not vectorized

i1 <- which(df$Name %in% 'group')
i2 <- which(df$Name %in% 'section')
df[unlist(Map(`:`, i1 1, i2-1)),]

-output

       Name value
4      Jean     4
5  Philippe     5
6    Celine     6
7      Dion     7
14     Rosa    14
15   Albert    15
16  Stromae    16

CodePudding user response：

You can use cumsum like so:

df %>% 
  mutate(cum = lag(cumsum(Name == "group"))   cumsum(Name == "section")) %>% 
  filter(cum %% 2 == 1)

Or, using sequence in base R:

start <- which(df$Name == "group")
end <- which(df$Name == "section")

df[sequence(end-start-1, start 1),]

output

# A tibble: 7 x 3
  Name     value   cum
  <chr>    <chr> <int>
1 Jean     4         1
2 Philippe 5         1
3 Celine   6         1
4 Dion     7         1
5 Rosa     14        3
6 Albert   15        3
7 Stromae  16        3

CodePudding user response：

Another possible solution:

library(tidyverse)

df %>% 
  mutate(aux1 = if_else(Name == "group", 1, NA_real_),
         aux2 = if_else(Name == "section", 2, NA_real_),
         aux = coalesce(aux1, aux2)) %>% 
  fill(aux) %>% 
  filter(aux == 1 & Name != "group") %>% 
  select(Name, value)

#>       Name value
#> 1     Jean     4
#> 2 Philippe     5
#> 3   Celine     6
#> 4     Dion     7
#> 5     Rosa    14
#> 6   Albert    15
#> 7  Stromae    16

CodePudding user response：

The following use a sort of state machine where the last state is found in the .x term within the accumulate cycle and the input is in .y:

library(dplyr)
library(purrr)

df |>
  mutate(state = accumulate(tail(Name, -1), ~{
    if (.y == "section")
      "end"
    else
      if (.y == "group")
        "start"
    else
      if (.x == "start")
        "within"
    else
      if (.x == "end")
        "outside"
    else
      .x
  }, .init = "outside"))

##>        Name value   state
##>1         x1     1 outside
##>2         NA     2 outside
##>3      group     3   start
##>4       Jean     4  within
##>5   Philippe     5  within
##>6     Celine     6  within
##>7       Dion     7  within
##>8    section     8     end
##>9         NA     9 outside
##>10        y2    10 outside
##>11        z1    11 outside
##>12        NA    12 outside
##>13     group    13   start
##>14      Rosa    14  within
##>15    Albert    15  within
##>16   Stromae    16  within
##>17   section    17     end
##>18        NA    18 outside
##>19       abc    19 outside
##>20 something    20 outside

Then you can filter by state == "within" to obtain:

#>       Name value 
#> 1     Jean     4 
#> 2 Philippe     5 
#> 3   Celine     6 
#> 4     Dion     7 
#> 5     Rosa    14 
#> 6   Albert    15 
#> 7  Stromae    16