Home > Software engineering >  R - Filtering rows - Keep data between two repeated values
R - Filtering rows - Keep data between two repeated values

Time:03-04

I would like to keep only rows before and after a specific values of a column in my data frame. My data frame below has a structure, where you can see that I have some sort of blocks if I can say so. For example, the data I am interested with always starts at group and ends at section, anything else outside I would like to remove.

# Sample Data
df = data.frame(
  Name = c ("x1","NA","group","Jean","Philippe","Celine","Dion","section","NA",
            "y2","z1","NA","group","Rosa","Albert","Stromae","section","NA","abc","something"), 
  value = as.character(seq(1:20))
)

df
        Name value
1         x1     1
2         NA     2
3      group     3
4       Jean     4
5   Philippe     5
6     Celine     6
7       Dion     7
8    section     8
9         NA     9
10        y2    10
11        z1    11
12        NA    12
13     group    13
14      Rosa    14
15    Albert    15
16   Stromae    16
17   section    17
18        NA    18
19       abc    19
20 something    20

Since the block group:section does not always have the same information, I don't know how can I tell R to keep rows between group andsection even if they are repeated. I only came up with this, which just keeps the rows the first time R sees group andsection.

df[which(df$Name=="group")[1]:which(df$Name=="section")[1],]
      Name value
3    group     3
4     Jean     4
5 Philippe     5
6   Celine     6
7     Dion     7
8  section     8

The desired output would be


4       Jean     4
5   Philippe     5
6     Celine     6
7       Dion     7
14      Rosa    14
15    Albert    15
16   Stromae    16

Thank you guys in advance for your help.

CodePudding user response:

There are multiple instances of those cases. So, we may need to loop to get the : as : is not vectorized

i1 <- which(df$Name %in% 'group')
i2 <- which(df$Name %in% 'section')
df[unlist(Map(`:`, i1 1, i2-1)),]

-output

       Name value
4      Jean     4
5  Philippe     5
6    Celine     6
7      Dion     7
14     Rosa    14
15   Albert    15
16  Stromae    16

CodePudding user response:

You can use cumsum like so:

df %>% 
  mutate(cum = lag(cumsum(Name == "group"))   cumsum(Name == "section")) %>% 
  filter(cum %% 2 == 1)

Or, using sequence in base R:

start <- which(df$Name == "group")
end <- which(df$Name == "section")

df[sequence(end-start-1, start 1),]

output

# A tibble: 7 x 3
  Name     value   cum
  <chr>    <chr> <int>
1 Jean     4         1
2 Philippe 5         1
3 Celine   6         1
4 Dion     7         1
5 Rosa     14        3
6 Albert   15        3
7 Stromae  16        3

CodePudding user response:

Another possible solution:

library(tidyverse)

df %>% 
  mutate(aux1 = if_else(Name == "group", 1, NA_real_),
         aux2 = if_else(Name == "section", 2, NA_real_),
         aux = coalesce(aux1, aux2)) %>% 
  fill(aux) %>% 
  filter(aux == 1 & Name != "group") %>% 
  select(Name, value)

#>       Name value
#> 1     Jean     4
#> 2 Philippe     5
#> 3   Celine     6
#> 4     Dion     7
#> 5     Rosa    14
#> 6   Albert    15
#> 7  Stromae    16

CodePudding user response:

The following use a sort of state machine where the last state is found in the .x term within the accumulate cycle and the input is in .y:

library(dplyr)
library(purrr)

df |>
  mutate(state = accumulate(tail(Name, -1), ~{
    if (.y == "section")
      "end"
    else
      if (.y == "group")
        "start"
    else
      if (.x == "start")
        "within"
    else
      if (.x == "end")
        "outside"
    else
      .x
  }, .init = "outside"))

##>        Name value   state
##>1         x1     1 outside
##>2         NA     2 outside
##>3      group     3   start
##>4       Jean     4  within
##>5   Philippe     5  within
##>6     Celine     6  within
##>7       Dion     7  within
##>8    section     8     end
##>9         NA     9 outside
##>10        y2    10 outside
##>11        z1    11 outside
##>12        NA    12 outside
##>13     group    13   start
##>14      Rosa    14  within
##>15    Albert    15  within
##>16   Stromae    16  within
##>17   section    17     end
##>18        NA    18 outside
##>19       abc    19 outside
##>20 something    20 outside

Then you can filter by state == "within" to obtain:

#>       Name value 
#> 1     Jean     4 
#> 2 Philippe     5 
#> 3   Celine     6 
#> 4     Dion     7 
#> 5     Rosa    14 
#> 6   Albert    15 
#> 7  Stromae    16 
  • Related