Restrict `fill` to last occurrence of pattern-CodePudding

I have data such as this:

df <- structure(list(line = c("001", "002", "003", "004", "005", "006", 
                              "007", "008", "009", "010", "011", "012", "013", "014"), 
                     utterance = c("((m: both hands", 
                                   "((m: both hands", 
                                   "((i: DH=1, SZ=0", "((i: DH=1, SZ=0", 
                                   "((s: Preface))", "((m: both hands", 
                                   "((m: both hands clasped", 
                                   "((m: both hands clasped", 
                                   "((s: Background))", "((m: enumerating", 
                                   "((m: enumerating", 
                                   "((s: End))", "((i: DH=1, SZ=0", "((m: relax gesture))"
                              )), row.names = c(NA, 14L), class = "data.frame")

I want to create a new column story and fill that column with those values from column utterance that match the regex pattern \\(\\(s. But I want the fill to stop at the last value matching such pattern, which is ((s: End)).

This fill command does not stop at that pattern - how can I stop fill at the pattern?

library(tidyr)
df %>%
  mutate(story = ifelse(grepl("\\(\\(s", utterance), utterance, NA)) %>%
  fill(story, .direction = "down")
   line               utterance             story
1   001         ((m: both hands              <NA>
2   002         ((m: both hands              <NA>
3   003         ((i: DH=1, SZ=0              <NA>
4   004         ((i: DH=1, SZ=0              <NA>
5   005          ((s: Preface))    ((s: Preface))
6   006         ((m: both hands    ((s: Preface))
7   007 ((m: both hands clasped    ((s: Preface))
8   008 ((m: both hands clasped    ((s: Preface))
9   009       ((s: Background)) ((s: Background))
10  010        ((m: enumerating ((s: Background))
11  011        ((m: enumerating ((s: Background))
12  012              ((s: End))        ((s: End))
13  013         ((i: DH=1, SZ=0        ((s: End))
14  014    ((m: relax gesture))        ((s: End))

Desired:

   line               utterance             story
1   001         ((m: both hands              <NA>
2   002         ((m: both hands              <NA>
3   003         ((i: DH=1, SZ=0              <NA>
4   004         ((i: DH=1, SZ=0              <NA>
5   005          ((s: Preface))    ((s: Preface))
6   006         ((m: both hands    ((s: Preface))
7   007 ((m: both hands clasped    ((s: Preface))
8   008 ((m: both hands clasped    ((s: Preface))
9   009       ((s: Background)) ((s: Background))
10  010        ((m: enumerating ((s: Background))
11  011        ((m: enumerating ((s: Background))
12  012              ((s: End))        ((s: End))
13  013         ((i: DH=1, SZ=0              <NA>
14  014    ((m: relax gesture))              <NA>

CodePudding user response：

tidyr::fill itself doesn't do that, but you can add one more mutate:

df %>%
  mutate(story = if_else(grepl("\\(\\(s", utterance), utterance, NA_character_)) %>%
  fill(story, .direction = "down") %>%
  mutate(story = if_else(story == last(story) & duplicated(story), NA_character_, story))
#    line               utterance             story
# 1   001         ((m: both hands              <NA>
# 2   002         ((m: both hands              <NA>
# 3   003         ((i: DH=1, SZ=0              <NA>
# 4   004         ((i: DH=1, SZ=0              <NA>
# 5   005          ((s: Preface))    ((s: Preface))
# 6   006         ((m: both hands    ((s: Preface))
# 7   007 ((m: both hands clasped    ((s: Preface))
# 8   008 ((m: both hands clasped    ((s: Preface))
# 9   009       ((s: Background)) ((s: Background))
# 10  010        ((m: enumerating ((s: Background))
# 11  011        ((m: enumerating ((s: Background))
# 12  012              ((s: End))        ((s: End))
# 13  013         ((i: DH=1, SZ=0              <NA>
# 14  014    ((m: relax gesture))              <NA>

This looks for the last occurrence of story and removes all but the first of that. This assumes that order matters, and does not assume that the last story must include the literal s: End, though you can update the logic accordingly if you prefer.

FYI, I changed from ifelse to if_else, as it is type-safe (base::ifelse is not). It requires being specific about which NA to use (there are over six different variants).

CodePudding user response：

We can use na.locf

library(dplyr)
library(zoo)
df %>%
  mutate(story = ifelse(grepl("\\(\\(s", utterance), utterance, NA), 
    ind = match("((s: End))", story),
   story = replace(story, seq_len(first(ind)), 
    zoo::na.locf0(story[seq_len(first(ind))])), ind = NULL)

-output

 line               utterance             story
1   001         ((m: both hands              <NA>
2   002         ((m: both hands              <NA>
3   003         ((i: DH=1, SZ=0              <NA>
4   004         ((i: DH=1, SZ=0              <NA>
5   005          ((s: Preface))    ((s: Preface))
6   006         ((m: both hands    ((s: Preface))
7   007 ((m: both hands clasped    ((s: Preface))
8   008 ((m: both hands clasped    ((s: Preface))
9   009       ((s: Background)) ((s: Background))
10  010        ((m: enumerating ((s: Background))
11  011        ((m: enumerating ((s: Background))
12  012              ((s: End))        ((s: End))
13  013         ((i: DH=1, SZ=0              <NA>
14  014    ((m: relax gesture))              <NA>