I would like to keep only rows before and after a specific values of a column in my data frame. My data frame below has a structure, where you can see that I have some sort of blocks if I can say so. For example, the data I am interested with always starts at group
and ends at section
, anything else outside I would like to remove.
# Sample Data
df = data.frame(
Name = c ("x1","NA","group","Jean","Philippe","Celine","Dion","section","NA",
"y2","z1","NA","group","Rosa","Albert","Stromae","section","NA","abc","something"),
value = as.character(seq(1:20))
)
df
Name value
1 x1 1
2 NA 2
3 group 3
4 Jean 4
5 Philippe 5
6 Celine 6
7 Dion 7
8 section 8
9 NA 9
10 y2 10
11 z1 11
12 NA 12
13 group 13
14 Rosa 14
15 Albert 15
16 Stromae 16
17 section 17
18 NA 18
19 abc 19
20 something 20
Since the block group
:section
does not always have the same information, I don't know how can I tell R
to keep rows between group
andsection
even if they are repeated. I only came up with this, which just keeps the rows the first time R sees group
andsection
.
df[which(df$Name=="group")[1]:which(df$Name=="section")[1],]
Name value
3 group 3
4 Jean 4
5 Philippe 5
6 Celine 6
7 Dion 7
8 section 8
The desired output would be
4 Jean 4
5 Philippe 5
6 Celine 6
7 Dion 7
14 Rosa 14
15 Albert 15
16 Stromae 16
Thank you guys in advance for your help.
CodePudding user response:
There are multiple instances of those cases. So, we may need to loop to get the :
as :
is not vectorized
i1 <- which(df$Name %in% 'group')
i2 <- which(df$Name %in% 'section')
df[unlist(Map(`:`, i1 1, i2-1)),]
-output
Name value
4 Jean 4
5 Philippe 5
6 Celine 6
7 Dion 7
14 Rosa 14
15 Albert 15
16 Stromae 16
CodePudding user response:
You can use cumsum
like so:
df %>%
mutate(cum = lag(cumsum(Name == "group")) cumsum(Name == "section")) %>%
filter(cum %% 2 == 1)
Or, using sequence
in base R:
start <- which(df$Name == "group")
end <- which(df$Name == "section")
df[sequence(end-start-1, start 1),]
output
# A tibble: 7 x 3
Name value cum
<chr> <chr> <int>
1 Jean 4 1
2 Philippe 5 1
3 Celine 6 1
4 Dion 7 1
5 Rosa 14 3
6 Albert 15 3
7 Stromae 16 3
CodePudding user response:
Another possible solution:
library(tidyverse)
df %>%
mutate(aux1 = if_else(Name == "group", 1, NA_real_),
aux2 = if_else(Name == "section", 2, NA_real_),
aux = coalesce(aux1, aux2)) %>%
fill(aux) %>%
filter(aux == 1 & Name != "group") %>%
select(Name, value)
#> Name value
#> 1 Jean 4
#> 2 Philippe 5
#> 3 Celine 6
#> 4 Dion 7
#> 5 Rosa 14
#> 6 Albert 15
#> 7 Stromae 16
CodePudding user response:
The following use a sort of state machine where the last state is found in the .x
term within the accumulate
cycle and the input is in .y
:
library(dplyr)
library(purrr)
df |>
mutate(state = accumulate(tail(Name, -1), ~{
if (.y == "section")
"end"
else
if (.y == "group")
"start"
else
if (.x == "start")
"within"
else
if (.x == "end")
"outside"
else
.x
}, .init = "outside"))
##> Name value state
##>1 x1 1 outside
##>2 NA 2 outside
##>3 group 3 start
##>4 Jean 4 within
##>5 Philippe 5 within
##>6 Celine 6 within
##>7 Dion 7 within
##>8 section 8 end
##>9 NA 9 outside
##>10 y2 10 outside
##>11 z1 11 outside
##>12 NA 12 outside
##>13 group 13 start
##>14 Rosa 14 within
##>15 Albert 15 within
##>16 Stromae 16 within
##>17 section 17 end
##>18 NA 18 outside
##>19 abc 19 outside
##>20 something 20 outside
Then you can filter by state == "within"
to obtain:
#> Name value
#> 1 Jean 4
#> 2 Philippe 5
#> 3 Celine 6
#> 4 Dion 7
#> 5 Rosa 14
#> 6 Albert 15
#> 7 Stromae 16