I have this type of data:
df <- structure(list(Utterance = c("(5.127)", ">like I don't understand< sorry like how old's your mom¿",
"(0.855)", "eh six:ty:::-one=", "(0.101)", "(0.487)", "[((v: gasps)) she said] ~no you're [not?]~",
"[((v: gasps)) she said] ~no you're [not?]~", "~<[NO YOU'RE] NOT (.) you can't go !in!>~",
"(0.260)", "show her [your boobs] next time"),
Q = c(NA, "q_wh", "", "", NA, NA, "q_really", "", "", NA, NA),
Sequ = c(NA, 1L, 1L, 1L, NA, NA, 0L, 0L, 0L, NA, NA)), class = "data.frame", row.names = c(NA, -11L))
I would like to extract/filter
- those rows where
Sequ
is notNA
and - the immediately preceding row (where
Sequ
isNA
)
My attempt so far is to define a function that gets the indices of the relevant rows:
QA_sequ <- function(value) {
inds <- which(!is.na(value) & lag(is.na(value)))
sort(unique(c(inds-1, inds)))
}
and then to slice out the rows via the indices:
library(dplyr)
df %>%
slice(QA_sequ(Sequ))
Utterance Q Sequ
1 (5.127) <NA> NA
2 >like I don't understand< sorry like how old's your mom¿ q_wh 1
3 (0.487) <NA> NA
4 [((v: gasps)) she said] ~no you're [not?]~ q_really 0
However, only the immediately preceding row and first Sequ
row are filtered. The result I want to obtain is this:
Utterance Q Sequ
1 (5.127) <NA> NA
2 >like I don't understand< sorry like how old's your mom¿ q_wh 1
3 (0.855) 1
4 eh six:ty:::-one= 1
5 (0.487) <NA> NA
6 [((v: gasps)) she said] ~no you're [not?]~ q_really 0
7 [((v: gasps)) she said] ~no you're [not?]~ 0
8 ~<[NO YOU'RE] NOT (.) you can't go !in!>~ 0
EDIT:
The solution I've come up with feels cumbersome:
QA_sequ <- function(value) {
inds <- which(!is.na(value) & lag(is.na(value)))
sort(unique(c(inds-1))) # extract only preceding row!
}
library(dplyr)
df %>%
mutate(id = row_number()) %>%
slice(QA_sequ(Sequ)) %>%
bind_rows(., df %>% mutate(id = row_number()) %>% filter(!is.na(Sequ))) %>%
arrange(id)
CodePudding user response:
How about this?
df %>%
filter(!is.na(Sequ) | lead(!is.na(Sequ), default=FALSE))
# Utterance Q Sequ
# 1 (5.127) <NA> NA
# 2 >like I don't understand< sorry like how old's your mom¿ q_wh 1
# 3 (0.855) 1
# 4 eh six:ty:::-one= 1
# 5 (0.487) <NA> NA
# 6 [((v: gasps)) she said] ~no you're [not?]~ q_really 0
# 7 [((v: gasps)) she said] ~no you're [not?]~ 0
# 8 ~<[NO YOU'RE] NOT (.) you can't go !in!>~ 0
The logic filters (extracts) both of:
- all non-
NA
values - any
NA
value where the next value is notNA
CodePudding user response:
Just add an additional OR to collect the rows where sequ
is not NA
but which don't have a corresponding lagged non-NA
...
QA_sequ <- function(value) {
inds <- which((!is.na(value) & lag(is.na(value))) | !is.na(value))
sort(unique(c(inds-1, inds)))
}
df %>% slice(QA_sequ(Sequ))
Utterance Q Sequ
1 (5.127) <NA> NA
2 >like I don't understand< sorry like how old's your mom¿ q_wh 1
3 (0.855) 1
4 eh six:ty:::-one= 1
5 (0.487) <NA> NA
6 [((v: gasps)) she said] ~no you're [not?]~ q_really 0
7 [((v: gasps)) she said] ~no you're [not?]~ 0
8 ~<[NO YOU'RE] NOT (.) you can't go !in!>~ 0
CodePudding user response:
This is using base R. Just as you were doing take the indices of rows then the preceding.
x<-which(!is.na((df$Sequ)))
x1 <- x-1
x<- unique(c(x,x1))
x<- x[order(x)]
df[x,]
You can pass the same vector into slice
to df %>% slice(x)