I have data such as this:
df <- structure(list(line = c("001", "002", "003", "004", "005", "006",
"007", "008", "009", "010", "011", "012", "013", "014"),
utterance = c("((m: both hands",
"((m: both hands",
"((i: DH=1, SZ=0", "((i: DH=1, SZ=0",
"((s: Preface))", "((m: both hands",
"((m: both hands clasped",
"((m: both hands clasped",
"((s: Background))", "((m: enumerating",
"((m: enumerating",
"((s: End))", "((i: DH=1, SZ=0", "((m: relax gesture))"
)), row.names = c(NA, 14L), class = "data.frame")
I want to create a new column story
and fill
that column with those values from column utterance
that match the regex pattern \\(\\(s
. But I want the fill
to stop at the last value matching such pattern, which is ((s: End))
.
This fill
command does not stop at that pattern - how can I stop fill
at the pattern?
library(tidyr)
df %>%
mutate(story = ifelse(grepl("\\(\\(s", utterance), utterance, NA)) %>%
fill(story, .direction = "down")
line utterance story
1 001 ((m: both hands <NA>
2 002 ((m: both hands <NA>
3 003 ((i: DH=1, SZ=0 <NA>
4 004 ((i: DH=1, SZ=0 <NA>
5 005 ((s: Preface)) ((s: Preface))
6 006 ((m: both hands ((s: Preface))
7 007 ((m: both hands clasped ((s: Preface))
8 008 ((m: both hands clasped ((s: Preface))
9 009 ((s: Background)) ((s: Background))
10 010 ((m: enumerating ((s: Background))
11 011 ((m: enumerating ((s: Background))
12 012 ((s: End)) ((s: End))
13 013 ((i: DH=1, SZ=0 ((s: End))
14 014 ((m: relax gesture)) ((s: End))
Desired:
line utterance story
1 001 ((m: both hands <NA>
2 002 ((m: both hands <NA>
3 003 ((i: DH=1, SZ=0 <NA>
4 004 ((i: DH=1, SZ=0 <NA>
5 005 ((s: Preface)) ((s: Preface))
6 006 ((m: both hands ((s: Preface))
7 007 ((m: both hands clasped ((s: Preface))
8 008 ((m: both hands clasped ((s: Preface))
9 009 ((s: Background)) ((s: Background))
10 010 ((m: enumerating ((s: Background))
11 011 ((m: enumerating ((s: Background))
12 012 ((s: End)) ((s: End))
13 013 ((i: DH=1, SZ=0 <NA>
14 014 ((m: relax gesture)) <NA>
CodePudding user response:
tidyr::fill
itself doesn't do that, but you can add one more mutate
:
df %>%
mutate(story = if_else(grepl("\\(\\(s", utterance), utterance, NA_character_)) %>%
fill(story, .direction = "down") %>%
mutate(story = if_else(story == last(story) & duplicated(story), NA_character_, story))
# line utterance story
# 1 001 ((m: both hands <NA>
# 2 002 ((m: both hands <NA>
# 3 003 ((i: DH=1, SZ=0 <NA>
# 4 004 ((i: DH=1, SZ=0 <NA>
# 5 005 ((s: Preface)) ((s: Preface))
# 6 006 ((m: both hands ((s: Preface))
# 7 007 ((m: both hands clasped ((s: Preface))
# 8 008 ((m: both hands clasped ((s: Preface))
# 9 009 ((s: Background)) ((s: Background))
# 10 010 ((m: enumerating ((s: Background))
# 11 011 ((m: enumerating ((s: Background))
# 12 012 ((s: End)) ((s: End))
# 13 013 ((i: DH=1, SZ=0 <NA>
# 14 014 ((m: relax gesture)) <NA>
This looks for the last occurrence of story
and removes all but the first of that. This assumes that order matters, and does not assume that the last story
must include the literal s: End
, though you can update the logic accordingly if you prefer.
FYI, I changed from ifelse
to if_else
, as it is type-safe (base::ifelse
is not). It requires being specific about which NA
to use (there are over six different variants).
CodePudding user response:
We can use na.locf
library(dplyr)
library(zoo)
df %>%
mutate(story = ifelse(grepl("\\(\\(s", utterance), utterance, NA),
ind = match("((s: End))", story),
story = replace(story, seq_len(first(ind)),
zoo::na.locf0(story[seq_len(first(ind))])), ind = NULL)
-output
line utterance story
1 001 ((m: both hands <NA>
2 002 ((m: both hands <NA>
3 003 ((i: DH=1, SZ=0 <NA>
4 004 ((i: DH=1, SZ=0 <NA>
5 005 ((s: Preface)) ((s: Preface))
6 006 ((m: both hands ((s: Preface))
7 007 ((m: both hands clasped ((s: Preface))
8 008 ((m: both hands clasped ((s: Preface))
9 009 ((s: Background)) ((s: Background))
10 010 ((m: enumerating ((s: Background))
11 011 ((m: enumerating ((s: Background))
12 012 ((s: End)) ((s: End))
13 013 ((i: DH=1, SZ=0 <NA>
14 014 ((m: relax gesture)) <NA>