I am processing keystroke data, and need to find the word that a keystroke is located within. Because there can be invisible keystrokes (like Shift), this is not a trivial problem where I can just iterate the index of keystrokes, and locate the word. Rather, I need to find the space-delimited word that the keystroke is produced within. I do have the full text and existing text available, which I should be able to leverage. I've tried solutions using fill()
, lag()
, and cumsum()
, but none are working.
I have a dataframe like the below, where I group by experiment_id
:
x <- tibble(
experiment_id = rep(c('1a','1b'),each=10),
keystroke = rep(c('a','SPACE','SHIFT','b','a','d','SPACE','m','a','n'),2),
existing_text = rep(c('a','a ','a ','a B','a Ba','a Bad','a Bad ',
'a Bad m','a Bad ma','a Bad man'),2),
final_text = 'a Bad man'
)
The additional column should look like:
within_word = c('a',NA,'Bad','Bad','Bad','Bad',NA,'man','man','man')
Is there a way to derive this?
CodePudding user response:
x %>%
mutate(ww = str_remove(existing_text, fixed(lag(existing_text, default = ".")))) %>%
group_by(grp = cumsum(ww== ' '|lag(ww == ' ', default = F))) %>%
mutate(within_word = str_c(ww, collapse = ''),
within_word = na_if(within_word, ' '))
# A tibble: 10 x 6
# Groups: grp [5]
keystroke existing_text final_text ww grp within_word
<chr> <chr> <chr> <chr> <int> <chr>
1 a "a" a Bad man "a" 0 a
2 SPACE "a " a Bad man " " 1 NA
3 SHIFT "a " a Bad man "" 2 Bad
4 b "a B" a Bad man "B" 2 Bad
5 a "a Ba" a Bad man "a" 2 Bad
6 d "a Bad" a Bad man "d" 2 Bad
7 SPACE "a Bad " a Bad man " " 3 NA
8 m "a Bad m" a Bad man "m" 4 man
9 a "a Bad ma" a Bad man "a" 4 man
10 n "a Bad man" a Bad man "n" 4 man